I'm working with a ton of high-resolution GPS data files that, when compressed, take up about 125GB, but uncompressed they might exceed 1TB. I've developed a Python script that processes each file one at a time and generates a numpy array for each file. Each of these arrays varies in length. After processing a file, I need to save the resulting array to disk and then append each new array to the previous one, so I end up with one expanding 1D list stored on my hard drive.
For instance, if I generate an array like [1,2,3,4] from the first file, and [5,6,7] from the second, I want my final file to look like [1,2,3,4,5,6,7]. The challenge here is that I anticipate the final array could contain over 10 billion floats, which means it would take up about 40GB of storage. However, I only have 16GB of RAM, so loading everything into memory to write it out at once isn't an option. I'm curious about how others handle this situation.
7 Answers
You might want to use the zipfile module to stream your input data instead. Only load a small portion into memory and append to your output file as needed. Check out the documentation [here](https://docs.python.org/3/library/zipfile.html) for more details!
It might be easiest just to write each row out as you finish processing it. That way, you avoid hitting RAM limits altogether!
Appending to files is pretty straightforward! Plus, if you want to save some space, consider streaming your output into a compression tool like gzip. Just output the data to stdout and pipe it right into gzip to save disk space as you go! Also, think about breaking your large dataset into smaller files. It'll make processing easier later on. And don’t forget to plan for crashes; you’ll want a way to resume if something goes wrong!
You could append each new set of numbers directly to your file as you process them. This way, the RAM usage stays low since you're only holding onto the numbers from one file at a time. I'm not a Python guru, but I believe this approach should work just fine!
Just be careful with batch sizes! It's often faster to write in chunks, sized appropriately for your disk's cache. If your hard drive has enough space, go ahead and write directly to the file without worrying too much about RAM.
Do you strictly need to use Python? You might have more control over memory management with a different programming language if you're open to it.
Using SQLite could simplify things a lot. It provides an easy way to handle data without getting overwhelmed. You can write each new array as you go, which would keep your memory usage down!
Think about how you'll retrieve the data. If you eventually need the entire array in memory, you might want to consider memory-mapped files. But really, it sounds like a database could serve you better here, depending on your needs.
Absolutely! You can use the Pandas library, which allows you to append your data frame to a CSV file like this: `x.to_csv("filepath/filename.csv", mode='a')`. It's super effective!