Programming

How can I efficiently count billions of items when they won’t fit in RAM?

May 19, 2025

Asked By TechWhiz42 On May 19, 2025

I'm working on a project that requires me to count occurrences of billions of items and their properties, but I'm struggling with performance issues because the dataset doesn't fit in RAM. My initial approach was to process as much as I could in RAM and then upsert the results into SQLite, but it's painfully slow and would take months to finish. I even switched to DuckDB hoping for better performance, but it went even slower. I've tried using a probability distribution and a bloom filter to optimize the process, but I'm still seeing no significant improvements. I'm considering partitioning the data to fit into RAM and making multiple passes but wanted to see if anyone has any better suggestions or tips!

1 Answer

Answered By DataCruncherX On May 20, 2025

You might want to look into the Hyperloglog algorithm if you can tolerate approximate counts. It’s a memory-efficient way to estimate the counts and is somewhat related to your bloom filter idea since it reduces the memory footprint. Just keep in mind that processing all that data is going to be slow regardless.

ItemCounter99 - May 22, 2025

To clarify, I still need precise counts per item. For example, item A appears 100 times, and 50 of those have property Z. I’m using the bloom filter primarily to eliminate unlikely candidates from the dataset to keep it statistically relevant.

Related Questions

How To: Running Codex CLI on Windows with Azure OpenAI

Set Wordpress Featured Image Using Javascript

How To Fix PHP Random Being The Same

Why no WebP Support with Wordpress

Replace Wordpress Cron With Linux Cron

Customize Yoast Canonical URL Programmatically

LEAVE A REPLY Cancel reply