I'm working on a project that parses the content of PDF orders and returns the results to users. I've hit a few roadblocks. It works well for known templates and even some unseen ones, but my biggest challenge comes with non-selectable text or scanned PDF orders that need OCR for text extraction. I tried using OCRmyPDF with Tesseract, but it often misses lines and jumbles the quantities. I've also given PaddleOCR a shot, but it gets stuck in a loop and never finishes processing. What can I use to achieve accurate OCR extraction?
5 Answers
Building a classifier could be a good idea if you can train it right, but it might be overkill with 1600 different templates to handle.
You might want to try TroCR available on Hugging Face. It's a Microsoft model that has given me good results, especially with structured data like tables in a welding environment. It’s not perfect, but I think it could be quite effective for your needs. Plus, you can fine-tune it with some known orders for even better accuracy.
I usually rely on the Mistral OCR endpoint. It's not perfect, but it has decent accuracy. Just a heads-up, this might not work if you want everything kept local.
Make sure to download the OCR models beforehand. If you don't, your server will automatically try to download around 1.1GB the first time it processes a document. This can be a hassle, especially if you're using Docker, as it will happen every time the container restarts.
Have you checked out Docling? It's a bit more powerful than what you might need, but it should give you great results for your OCR tasks.

I've tried Mistral too, using it through OpenRouter, and it worked pretty well when combined with the AI reviewer. Is there a way to use Mistral independently from OpenRouter, maybe without the AI reviewer as a fallback?