Hello! I'm currently interning at a large international organization and I've been tasked with improving the search functionality of an extensive PDF database, which is a bit daunting given my limited experience with AI and machine learning. There are about 27,000 PDFs, some dating back to the 1970s. These documents are stored in SharePoint and currently, the only way to search them is through metadata filters like language and origin.
My goal is to make the search not only more accessible but also sustainable after I leave. I'm thinking about starting small, with metadata and keyword searches, and, if time allows, implementing more complex contextual searches.
I'm contemplating mirroring the SharePoint data in a PostgreSQL database to manage the documents and their metadata, but I'm unsure if this approach is necessary or if there are simpler, cost-effective options that would work just as well. Additionally, I'm curious if it's feasible to set all this up within my 6-month internship given my current skill level. Any advice would really help!
4 Answers
SharePoint itself supports metadata, so you might not need a PostgreSQL database right away. Focus on OCR for your documents first, and if you decide on syncing with an external database, ensure it’s just the necessary metadata linked back to SharePoint.
For this kind of project, Elastic Search could be a great fit! It handles large volumes of documents well and offers powerful search capabilities.
A good starting point would be to extract text and use OCR where needed, then build a basic keyword search. Later on, you can implement embeddings for semantic search. This approach avoids duplicating your entire document database and keeps maintenance easier long-term.
Your approach is valid—just be cautious about fully replicating SharePoint into another database as it can complicate synchronization. Instead, extract your text once and store it in a day-to-day convenient format like Postgres or a search index, linking back to the original PDFs in SharePoint. Starting with OCR and keyword search will still be a massive improvement!

Related Questions
How to Build a Custom GPT Journalist That Posts Directly to WordPress
Fix Not Being Able To Add New Categories With Intuitive Category Checklist For Wordpress
Get Real User IP Without Installing Cloudflare Apache Module
How to Get Total Line Count In Visual Studio 2013 Without Addons
Install and Configure PhpMyAdmin on Centos 7
How To Setup PostfixAdmin With Dovecot and Postfix Virtual Mailbox