I'm diving into setting up OCR workflows on AWS for processing academic PDFs, which include various challenges like poor layouts, footnotes, and the occasional formula. My ultimate goal is to convert these documents into clean Markdown for easier processing downstream. I've started testing locally with Tesseract (using Docker) and have also looked into OCRFlux for its capabilities with cross-page tables and multilingual content. I've tried a few AWS setups:
1. **EC2 (g4dn/x86 instance)**: This was quite straightforward and I managed to run OCRFlux effectively with CUDA support. It's cost-efficient for batch jobs a few times a week, but feels somewhat inefficient to keep the instance running for infrequent tasks.
2. **Lambda (with custom layers and Tesseract)**: It worked okay for single-page PDFs, but hit limitations on memory and timeout for larger documents, plus it lacks GPU support, which affects performance.
3. **EKS with GPU nodes**: The setup was complex, but it's the most scalable option. I managed to containerize OCRFlux and implemented a controller for document intake. It handles batch processing well, but costs can rise depending on GPU allocations and how many nodes are active.
Now, I'm trying to determine the best balance between cost and orchestration for processing 500-1000 PDFs a month. I'm also curious if others have used AWS Batch or Fargate for similar workloads. Lambda seems limited, while EC2 feels too manual for queued jobs. Additionally, I'm considering if tools like Textract or Comprehend can serve my needs, although I'm skeptical about their layout fidelity. Would love to know how others manage similar OCR and document parsing tasks on AWS, especially regarding balancing GPU needs and cost optimization. Also, if anyone has experience with OCRFlux or comparable tools in the cloud, I'd really appreciate your insights!
5 Answers
Why not take an open-source library that doesn't require GPU and run it on Lambda? While you can use Textract for info extraction, transforming it to Markdown will need scripting, which can get tricky. There are libraries that can simplify that process right out of the box!
Have you considered using Amazon Textract from the start? It might simplify some of the challenges you're facing with OCR and layout parsing, especially since it's designed for document structures.
If GPU isn't essential for your workloads, you might want to check out trigger.dev. They provide 16 GB RAM machines with queue support which should handle your document processing without needing GPU resources. It might help manage memory issues.
Have you given Mistral OCR a shot? It might handle various layouts better, depending on what formats you’re working with!
Have you thought about Amazon Rekognition for this task? It might be a good fit depending on your specific needs and layouts.
I was thinking the same! Textract seems like a solid choice for accurate text extraction.