What’s the Best Database Structure for Fast Primary Key Lookups?

0
6
Asked By TechyTurtle42 On

Hey everyone!

I'm building an application that generates batches of strings using an algorithm. I need to check if these strings exist in a huge dataset that has over 2 billion rows, but loading all of that into memory isn't practical. I'm expecting to generate batches of 100 to 5000 strings and possibly process several million checks per hour.

Right now, I'm considering using AWS for this. I thought about storing the dataset on RDS, using the primary key for existence checks based on those strings. There are a couple of other columns I'll need info from later, but primarily, I want to check if the strings exist.

Would something like DynamoDB be a better choice? Also, the app will run on ECS, and I'm a bit worried about I/O performance since streaming data is slow on local storage. It seems like AWS might have optimizations for "storage mounted" containers. My main priorities are cost (since RDS Aurora has those I/O fees) and performance. Any advice? Thanks!

5 Answers

Answered By HashItOut45 On

Just curious about the application you’re working on. If those strings are extensive, maybe you’d consider storing only their hashes instead to save space and improve lookup speeds. Just a thought!

Answered By StringCheckerX On

What kind of string checks are you planning? Is it just for exact matches or substrings too? And do you have speed expectations for how fast you need those checks?

Answered By CleverCoder89 On

Depending on how often you check for certain strings, using ElastiCache might be beneficial. You can check that first, and if it’s not there, query the database and then store it in ElastiCache. It might help with the frequent lookups, although I doubt you could keep all 2 billion in memory.

Answered By DataDynamo88 On

You might want to look into using Bloom filters to speed things up. Bloom filter indexes in Postgres can help you avoid having to scan the whole table to check for string existence—definitely a good strategy if you're dealing with such large data.

CheckingIn80s -

Great advice on Bloom filters! Just a heads up, though: the Postgres bloom index isn’t super straightforward since it’s more per row rather than a single filter. Still worth checking out!

Answered By SpeedyQuery101 On

If local storage is too slow, that’s definitely a concern. You could explore using embedded databases like RocksDB with Bloom filters enabled for rapid access. Basically, you’d load all your rows, compact them, and then put them somewhere like S3 for better access speed during application runs. If you can handle a few false positives, the overall process could be way smoother and faster.

EfficientEngineer -

How do you ensure you don’t lose accuracy while using Bloom filters? Just curious!

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.