Good morning everyone! I'm looking to design a searchable database that contains 200,000 PDF books, all stored on Verbatim 128 GB optical discs. I'm curious about which software tools or programs I should use to manage and query this database before I start burning the discs. Additionally, what kind of data structure and search architecture would be best for easy offline retrieval? My main goal is to ensure that within the next 20 years, I can access and search the entire archive locally on a standard PC with a disc reader, without needing any internet connectivity.
4 Answers
I recommend following the 3-2-1 backup rule: keep three copies of your data across two different formats, with one copy stored off-site. While it might seem tempting to create your own indexing software, there are already many tried-and-true archival systems out there. Maybe consider tape backups along with these discs!
You might want to check out Paperless for indexing your PDFs; it's typically used for organizing paperless offices—but it should work for your purpose too, albeit it might take a while to set everything up.
It sounds like a hefty project, but good news—there are open source solutions that can help! For instance, Lucene and Recoll are great tools for building a searchable index for your PDFs. Lucene is perfect if you want a customizable frontend, while Recoll comes with a GUI and can handle multiple data sources. You can even create a scripted solution that builds a searchable index, splits your collection across the discs, and updates the index as needed. Just keep in mind the structure to make everything manageable!
Those tools definitely make things easier. A script could also help automate the process of cataloging as you burn the discs!
Before you dive in, consider the longevity of those discs. Depending on how they're made and stored, optical discs can vary widely in lifespan. It’s crucial to have a solid directory structure with meaningful filenames on each disc. Also include an HTML contents page so you can access it without needing any special software. For indexing and searching, tools like Elasticsearch or Solr could be beneficial, but be sure to think about the long-term management as tech evolves!
Right, and don't forget to include a unique ID and metadata file at the root of each disc for better indexing later!

I think you're right about the focus— the search index itself isn't the toughest part. The challenge is ensuring efficient offline storage across multiple discs, which might lean towards using older archiving systems.