Applications

Advice Needed for Enhancing Document Search in a Large PDF Database

March 8, 2026

Asked By CuriousTraveler123 On March 8, 2026

Hello! I'm currently interning at a large international organization and I've been tasked with improving the search functionality of an extensive PDF database, which is a bit daunting given my limited experience with AI and machine learning. There are about 27,000 PDFs, some dating back to the 1970s. These documents are stored in SharePoint and currently, the only way to search them is through metadata filters like language and origin.

My goal is to make the search not only more accessible but also sustainable after I leave. I'm thinking about starting small, with metadata and keyword searches, and, if time allows, implementing more complex contextual searches.

I'm contemplating mirroring the SharePoint data in a PostgreSQL database to manage the documents and their metadata, but I'm unsure if this approach is necessary or if there are simpler, cost-effective options that would work just as well. Additionally, I'm curious if it's feasible to set all this up within my 6-month internship given my current skill level. Any advice would really help!

4 Answers

Answered By DataWhisperer99 On March 11, 2026

SharePoint itself supports metadata, so you might not need a PostgreSQL database right away. Focus on OCR for your documents first, and if you decide on syncing with an external database, ensure it’s just the necessary metadata linked back to SharePoint.

Answered By SearchGuru88 On March 11, 2026

For this kind of project, Elastic Search could be a great fit! It handles large volumes of documents well and offers powerful search capabilities.

Answered By TechSage42 On March 9, 2026

A good starting point would be to extract text and use OCR where needed, then build a basic keyword search. Later on, you can implement embeddings for semantic search. This approach avoids duplicating your entire document database and keeps maintenance easier long-term.

Answered By FutureInnovation77 On March 9, 2026

Your approach is valid—just be cautious about fully replicating SharePoint into another database as it can complicate synchronization. Instead, extract your text once and store it in a day-to-day convenient format like Postgres or a search index, linking back to the original PDFs in SharePoint. Starting with OCR and keyword search will still be a massive improvement!

Advice Needed for Enhancing Document Search in a Large PDF Database

4 Answers

Related Questions

How to Build a Custom GPT Journalist That Posts Directly to WordPress

Fix Not Being Able To Add New Categories With Intuitive Category Checklist For Wordpress

Get Real User IP Without Installing Cloudflare Apache Module

How to Get Total Line Count In Visual Studio 2013 Without Addons

Install and Configure PhpMyAdmin on Centos 7

How To Setup PostfixAdmin With Dovecot and Postfix Virtual Mailbox

LEAVE A REPLY Cancel reply