I'm experimenting with setting up OCR workflows on AWS to handle academic PDFs that can be scanned or have complex layouts like multi-columns and footnotes. My goal is to convert these PDFs into clean Markdown for further processing. Currently, I've been testing locally with Tesseract in Docker and recently tried using OCRFlux since it supports cross-page tables and multiple languages.
Here's what I've tried so far:
1. **EC2**: I used a g4dn/x86 instance which was straightforward for OCRFlux with CUDA support. It's cost-effective if I run batch jobs a few times a week, but it feels wasteful to have an instance constantly running for bursty tasks.
2. **Lambda**: I attempted to squeeze a lightweight version of Tesseract into Lambda, which worked for single-page PDFs but struggled with larger files due to memory and timeout limits. The lack of GPU also affects performance here.
3. **EKS**: This was the most complex but scalable, as I containerized OCRFlux and utilized Kubernetes for document processing. It works well for larger batches, though costs can increase based on GPU use.
Now, I'm curious about the best balance between cost and ease of orchestration for processing 500-1000 PDFs a month. Has anyone had experience with Batch or Fargate for this? Lambda feels limited for the workload, while EC2 seems too manual. Additionally, I wonder if anyone's done OCR with Textract or Comprehend and how well that worked given my need for layout fidelity.
2 Answers
Why didn’t you consider using Amazon Textract from the start? It could handle the OCR tasks without much hassle, especially for structured data.
If you're not using GPUs, check out Trigger.dev. They have machines with 16 GB RAM and built-in queue support, which works great for single-document OCR without the need for a GPU. You can also code it to manage memory effectively since it's a long-running process.
Yeah, I was wondering the same thing. Textract might save you a lot of development time!