Advice Needed for Enhancing Document Search in a Large PDF Database

0
9
Asked By CuriousTraveler123 On

Hello! I'm currently interning at a large international organization and I've been tasked with improving the search functionality of an extensive PDF database, which is a bit daunting given my limited experience with AI and machine learning. There are about 27,000 PDFs, some dating back to the 1970s. These documents are stored in SharePoint and currently, the only way to search them is through metadata filters like language and origin.

My goal is to make the search not only more accessible but also sustainable after I leave. I'm thinking about starting small, with metadata and keyword searches, and, if time allows, implementing more complex contextual searches.

I'm contemplating mirroring the SharePoint data in a PostgreSQL database to manage the documents and their metadata, but I'm unsure if this approach is necessary or if there are simpler, cost-effective options that would work just as well. Additionally, I'm curious if it's feasible to set all this up within my 6-month internship given my current skill level. Any advice would really help!

4 Answers

Answered By DataWhisperer99 On

SharePoint itself supports metadata, so you might not need a PostgreSQL database right away. Focus on OCR for your documents first, and if you decide on syncing with an external database, ensure it’s just the necessary metadata linked back to SharePoint.

Answered By SearchGuru88 On

For this kind of project, Elastic Search could be a great fit! It handles large volumes of documents well and offers powerful search capabilities.

Answered By TechSage42 On

A good starting point would be to extract text and use OCR where needed, then build a basic keyword search. Later on, you can implement embeddings for semantic search. This approach avoids duplicating your entire document database and keeps maintenance easier long-term.

Answered By FutureInnovation77 On

Your approach is valid—just be cautious about fully replicating SharePoint into another database as it can complicate synchronization. Instead, extract your text once and store it in a day-to-day convenient format like Postgres or a search index, linking back to the original PDFs in SharePoint. Starting with OCR and keyword search will still be a massive improvement!

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.