I'm working on a project that requires me to count occurrences of billions of items and their properties, but I'm struggling with performance issues because the dataset doesn't fit in RAM. My initial approach was to process as much as I could in RAM and then upsert the results into SQLite, but it's painfully slow and would take months to finish. I even switched to DuckDB hoping for better performance, but it went even slower. I've tried using a probability distribution and a bloom filter to optimize the process, but I'm still seeing no significant improvements. I'm considering partitioning the data to fit into RAM and making multiple passes but wanted to see if anyone has any better suggestions or tips!
1 Answer
You might want to look into the Hyperloglog algorithm if you can tolerate approximate counts. It’s a memory-efficient way to estimate the counts and is somewhat related to your bloom filter idea since it reduces the memory footprint. Just keep in mind that processing all that data is going to be slow regardless.
To clarify, I still need precise counts per item. For example, item A appears 100 times, and 50 of those have property Z. I’m using the bloom filter primarily to eliminate unlikely candidates from the dataset to keep it statistically relevant.