I've got a massive dataset with 40,000 samples, and each sample is a huge 5,000-dimensional numpy array. The total dataset comprises 45,150 .hea and .mat files that I've read into numpy arrays of shape (5000, 12), with labels provided as a 63-element multihot numpy array. The issue is, my RAM can't handle this size, so I need some advice on how to save this data without having to loop through the files again. Also, how can I efficiently load this data to fit a model? I've tried saving to CSV, but it loses data, and pandas wasn't helpful either since I couldn't save to parquet. All the file types I've tried end up consuming too much memory (around 20GB), causing crashes. Any suggestions?
2 Answers
You should try chunking your data! Instead of loading everything at once, consider processing it in smaller batches. Save each batch separately and load them one at a time when you need to train your model. This will help you manage memory better and prevent crashes.
Streaming is another option. You can read your dataset piece by piece instead of loading it all into memory. This way, you can handle it without overwhelming your RAM. Look into libraries like Dask or PyTables to help with that.
Related Questions
How To: Running Codex CLI on Windows with Azure OpenAI
Set Wordpress Featured Image Using Javascript
How To Fix PHP Random Being The Same
Why no WebP Support with Wordpress
Replace Wordpress Cron With Linux Cron
Customize Yoast Canonical URL Programmatically