I'm working on setting up OCR workflows for academic PDFs on AWS, which include a mix of scanned documents and those with complex layouts like multi-columns and footnotes. My goal is to convert these PDFs into clean Markdown for further processing. I started locally with Tesseract via Docker, and I've also tried OCRFlux for its ability to handle cross-page tables and multilingual content. I've experimented with three approaches: using EC2 with CUDA support for OCRFlux, which is manageable for batch jobs despite feeling wasteful; Lambda with Tesseract in custom layers, which works for simpler PDFs but struggles with larger documents due to its memory and timeout limits; and EKS with GPU nodes, which is scalable but complicated and costly depending on the number of nodes. I'm looking for advice on the best trade-off between cost and orchestration for processing 500-1000 PDFs a month. Has anyone used Batch or Fargate for such workloads? Also, has anyone offloaded OCR to tools like Textract or Comprehend, despite their layout limitations? I'd love to hear any experiences with modern parsers like OCRFlux and cloud deployment specifics.
5 Answers
Have you considered using Amazon Textract? It might simplify your OCR process given it’s designed for reading various document types. It can handle different formats and layouts, but I get that you might need finer control for certain layouts. Just curious, what stopped you from trying it first?
Open-source libraries could be a good fit here as well, especially ones that don’t require GPUs. Textractor can extract text, but remember, the transition to Markdown has to happen via a separate script. This isn't super easy but can be managed with the right library. Just a thought!
Have you tested Mistral OCR? It's another option to consider. It might not support all layouts you have, but worth exploring if you're dealing with up to 1000 PDFs per month.
Have you looked at Amazon Rekognition? It’s another tool that could handle OCR tasks. Just a suggestion if you're exploring options.
If you're not heavily reliant on GPU power, check out trigger.dev. They offer machines with 16 GB RAM that support queuing processes, which might be great for single-document OCR without needing a GPU. We use it for our tasks, and it handles the memory management well, making it stable for longer jobs.
Related Questions
Fix Not Being Able To Add New Categories With Intuitive Category Checklist For Wordpress
Get Real User IP Without Installing Cloudflare Apache Module
How to Get Total Line Count In Visual Studio 2013 Without Addons
Install and Configure PhpMyAdmin on Centos 7
How To Setup PostfixAdmin With Dovecot and Postfix Virtual Mailbox
Dovecot Error Unknown database driver mysql