Tips for Handling Large Datasets Too Big for RAM

0
12
Asked By TechyExplorer99 On

I've got a massive dataset with 40,000 samples, and each sample is a huge 5,000-dimensional numpy array. The total dataset comprises 45,150 .hea and .mat files that I've read into numpy arrays of shape (5000, 12), with labels provided as a 63-element multihot numpy array. The issue is, my RAM can't handle this size, so I need some advice on how to save this data without having to loop through the files again. Also, how can I efficiently load this data to fit a model? I've tried saving to CSV, but it loses data, and pandas wasn't helpful either since I couldn't save to parquet. All the file types I've tried end up consuming too much memory (around 20GB), causing crashes. Any suggestions?

2 Answers

Answered By DataCruncher82 On

You should try chunking your data! Instead of loading everything at once, consider processing it in smaller batches. Save each batch separately and load them one at a time when you need to train your model. This will help you manage memory better and prevent crashes.

Answered By CodeWizard21 On

Streaming is another option. You can read your dataset piece by piece instead of loading it all into memory. This way, you can handle it without overwhelming your RAM. Look into libraries like Dask or PyTables to help with that.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.