I'm working on a data processing application that's part of an ETL pipeline, and it's implemented in C#. Currently, I sort large datasets containing millions of records using this approach:
1. I write the unsorted records to a binary file on disk.
2. I keep track of sort keys and their offsets in memory. If there are too many, I sort them in manageable chunks in memory before writing those to disk.
3. Finally, I perform a k-way merge sort on the sorted chunks by reading each sorted key's offset to access the full record from the binary file.
This method effectively handles large amounts of data since it doesn't require sorting everything in memory at once. However, it feels overly complicated, so I'm looking for ways to optimize it. One thought was to use a database for sorting instead, but when I tried SQLite, it seemed to slow down the process instead of helping. Any suggestions for improving the sorting efficiency would be greatly appreciated!
6 Answers
Your current implementation sounds pretty strong, but the key to optimization lies in identifying which system resource—CPU, memory, or I/O—is being maxed out now. As you scale the input, pay attention to which area is slowing down first. For example, inserting records into an SQLite DB tends to be I/O intensive and can limit performance, especially if it's single-threaded.
Using a SQL database could streamline your process. By storing the records there and creating an index on the key fields, you could leverage the database's optimization capabilities for sorting. For example, with Postgres, you can store binary blobs, which could be very efficient.
How large are your datasets in relation to your available RAM? If the total size of your data (like 5 GB) fits comfortably in RAM, you could simplify your approach by loading everything into memory. However, if the data exceeds your RAM, your current method isn't unreasonable, but I suspect reading records based on their offsets could be your biggest bottleneck. Try profiling your algorithm: measuring how long each part takes could help identify what's slowing things down.
The app needs to work for users with various system specs, so they might not have much RAM. Also, users can end up with datasets that reach several GBs because of how the app functions.
You might want to consider using Redis if your records are relatively small. It's an in-memory cache and database that’s optimized for speed, making it a solid option for quick sorting operations.
Merge sort is a solid approach here, but it can be quite memory-intensive. Have you considered if your records fit in memory? You can likely handle gigabytes without too much trouble. It may often be beneficial to keep data in memory to reduce disk I/O, which can drastically speed things up.
Instead of explaining the whole algorithm, just tell us the name of the sorting algorithm you used; it could save time and effort for everyone.

I tried that with SQLite, thinking it would alleviate some work, but it ended up being slower than my manual sorting method. I prefer to keep the app lightweight, so shipping it with a full SQL server feels like overkill.