I'm curious about how to create a project similar to Repost Sleuth, which can search through millions of photos in just 1-2 seconds. My guess is that it encodes each image into a string to compare, but I wonder if that's fast enough. What algorithms or techniques could be used to make this work efficiently? Any insights would be appreciated!
5 Answers
Using checksums or similar methods could also work well here. They help ensure quick comparisons and saves resources!
Using a hash or checksum is smart. It shrinks each image down to a small 128-256 bit value, which reduces storage needs and enables fast lookups. You won't have to compare every image, just the hashes! For hashing, consider using XXH3_128bits or BLAKE3, but keep in mind that those might not be the best for handling slight alterations in images.
Optimize using hashing, indexing, and maybe a binary tree data structure. Those approaches will significantly speed things up. You can also check out the Reddit Repost Sleuth project on GitHub for more details!
Hash codes are definitely a good approach since they don't rely on the original pixels, allowing efficient comparisons. They also help in case users don't significantly alter the images.
You're on the right track with the encoding idea. Most likely, each photo is hashed into a large number that gets stored. Instead of recalculating for every new photo, it just computes a hash for that one and compares it to a database of stored hashes. If you set up proper indexing, these lookups can be lightning fast!
Actually, those cryptographic hash functions might not be ideal, especially for images. A specialized image hash function could be more effective; there are some great ones available, like the ones found in the GitHub link for imagehash.