I'm experimenting with OCR workflows for academic PDFs on AWS and would love to learn from others' experiences. Some of these PDFs are scanned, while others have complex layouts with multiple columns, footnotes, and occasional formulas. My goal is to transform them into clean Markdown for easier processing. I've been testing locally with Tesseract in Docker and recently explored OCRFlux due to its ability to handle cross-page tables and multilingual data.
Here's what I've tried so far:
1. **EC2 (g4dn/x86 instance)**: This setup was pretty straightforward and worked fine with OCRFlux. The cost is manageable for running batch jobs a few times a week since I can spin the instance down afterward. However, I feel it's a bit wasteful to keep an instance running for a task that has sporadic demand.
2. **Lambda (with Tesseract)**: I attempted to use a lightweight version of Tesseract in Lambda through custom layers. It worked okay for single-page PDFs, but the memory and timeout limits are frustrating when handling larger documents or heavy post-processing. Plus, without GPU support, performance isn't great.
3. **EKS with GPU nodes**: This was the trickiest option to set up but proved to be the most scalable. I containerized OCRFlux and developed a small controller to manage document intake and output to S3. Kicking off jobs via Kubernetes Jobs works well for batching dozens of PDFs, but costs can skyrocket depending on how many nodes and GPU resources I maintain.
I'm still figuring out a few things:
- For smaller volumes (like 500-1000 PDFs a month), what's the best trade-off between cost and orchestration ease?
- Has anyone used Batch or Fargate for similar workloads? Lambda seems restricted, while EC2 feels too "manual" for a job queue.
- I'm also curious if anyone has offloaded the OCR process to services like Textract or Comprehend, though I'm skeptical about their layout fidelity for my needs.
If anyone has dealt with similar OCR workloads on AWS, I'd love to hear how you've approached it, particularly if you've balanced GPU-intensive processing with cost effectiveness. Also, if you've tested OCRFlux or similar modern parsers and how you set them up in the cloud, please share your insights!
5 Answers
Have you looked into AWS Rekognition? It might also have some capabilities for your needs.
Have you considered using Amazon Textract from the get-go? It's tailored for document processing and could save you setup time on OCR workflows.
If GPU isn't absolutely necessary, you might want to check out trigger.dev. They offer 16 GB RAM machines with queue support, which could be beneficial for single document OCR processes without needing GPUs.
You could try an open-source project that doesn’t require a GPU and run it on Lambda. Textract can extract information, but converting that to Markdown requires additional scripting, which can be tricky. Using a library that already handles the transformation would streamline the process.
Have you given Textract or Mistral OCR a shot? It might depend on the types of layouts and formats you're dealing with, but they're worth considering.
I second that! Textract might simplify things for you.