How can I efficiently count billions of items when they won’t fit in RAM?

0
2
Asked By TechWhiz42 On

I'm working on a project that requires me to count occurrences of billions of items and their properties, but I'm struggling with performance issues because the dataset doesn't fit in RAM. My initial approach was to process as much as I could in RAM and then upsert the results into SQLite, but it's painfully slow and would take months to finish. I even switched to DuckDB hoping for better performance, but it went even slower. I've tried using a probability distribution and a bloom filter to optimize the process, but I'm still seeing no significant improvements. I'm considering partitioning the data to fit into RAM and making multiple passes but wanted to see if anyone has any better suggestions or tips!

1 Answer

Answered By DataCruncherX On

You might want to look into the Hyperloglog algorithm if you can tolerate approximate counts. It’s a memory-efficient way to estimate the counts and is somewhat related to your bloom filter idea since it reduces the memory footprint. Just keep in mind that processing all that data is going to be slow regardless.

ItemCounter99 -

To clarify, I still need precise counts per item. For example, item A appears 100 times, and 50 of those have property Z. I’m using the bloom filter primarily to eliminate unlikely candidates from the dataset to keep it statistically relevant.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.