Finding Accurate OCR Solutions for Scanned PDF Text Extraction

0
2
Asked By CreativeCactus29 On

I'm working on a project that parses the content of PDF orders and returns the results to users. I've hit a few roadblocks. It works well for known templates and even some unseen ones, but my biggest challenge comes with non-selectable text or scanned PDF orders that need OCR for text extraction. I tried using OCRmyPDF with Tesseract, but it often misses lines and jumbles the quantities. I've also given PaddleOCR a shot, but it gets stuck in a loop and never finishes processing. What can I use to achieve accurate OCR extraction?

5 Answers

Answered By SolverSage42 On

Building a classifier could be a good idea if you can train it right, but it might be overkill with 1600 different templates to handle.

Answered By PrecisionPal31 On

You might want to try TroCR available on Hugging Face. It's a Microsoft model that has given me good results, especially with structured data like tables in a welding environment. It’s not perfect, but I think it could be quite effective for your needs. Plus, you can fine-tune it with some known orders for even better accuracy.

Answered By SmartScanner23 On

I usually rely on the Mistral OCR endpoint. It's not perfect, but it has decent accuracy. Just a heads-up, this might not work if you want everything kept local.

CuriousCoder88 -

I've tried Mistral too, using it through OpenRouter, and it worked pretty well when combined with the AI reviewer. Is there a way to use Mistral independently from OpenRouter, maybe without the AI reviewer as a fallback?

Answered By TechieTurtle91 On

Make sure to download the OCR models beforehand. If you don't, your server will automatically try to download around 1.1GB the first time it processes a document. This can be a hassle, especially if you're using Docker, as it will happen every time the container restarts.

Answered By DataDynamo77 On

Have you checked out Docling? It's a bit more powerful than what you might need, but it should give you great results for your OCR tasks.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.