I'm exploring how to set up OCR workflows on AWS for processing academic PDFs. These PDFs vary in quality, with some being scanned and others having difficult layouts like multi-columns and footnotes crammed with text. My ultimate goal is to convert these documents into clean Markdown for easier downstream processing. I started off doing some local testing with Tesseract in a Docker container, and recently, I experimented with OCRFlux, which is good at handling complex content like cross-page tables and multilingual text.
Here's a quick rundown of what I've tried so far:
1. **EC2 (g4dn/x86 instance)**: This was easy to set up and worked well with OCRFlux using CUDA for performance. It's a cost-effective option for running batch jobs a few times weekly, but having a constantly running instance seems inefficient for my sporadic workload.
2. **Lambda (using layers with Tesseract)**: I attempted to load a lighter version of Tesseract into Lambda. It's decent for processing single-page PDFs but struggles with memory limits and timeout issues for larger documents. Plus, it doesn't support GPU, which affects the performance.
3. **EKS with GPU nodes**: Setup for this was quite intricate, but it proved to be the most scalable solution. I managed to containerize OCRFlux, set up a small controller for document processing, and pushed outputs to S3 with k8s Jobs. While effective, it can become costly as I need to keep more nodes and GPUs running for larger batches.
I'm currently trying to figure out what the best mix of cost and orchestration ease is for about 500 to 1000 PDFs each month. Have any of you tried using AWS Batch or Fargate for workloads like this? Lambda seems lacking while EC2 feels overly manual for my needs. Also, has anyone considered using AWS Textract or Comprehend, though I doubt they'd meet my layout requirements? I'd appreciate any insights from those who have experience with similar document parsing and OCR workloads on AWS, especially regarding balancing GPU-heavy processing with cost efficiency.
5 Answers
You could try an open-source project that doesn't need GPU on Lambda, like Textractor. It captures information but keeps in mind, converting everything to Markdown takes some custom scripting, which can get really tricky.
Why didn't you consider using Amazon Textract from the start? It might save you some hassle compared to the other methods you've chosen.
Have you looked into alternatives like Textract or Mistral OCR? I'm not sure how varied your document layouts are, but they could be worth investigating.
Have you considered Amazon Rekognition? It might be another tool to bring into the mix for your OCR needs.
If GPU isn't strictly necessary for your tasks, you might want to explore trigger.dev. They offer machines with 16 GB of RAM and built-in queue support, which works well for single-document OCR without needing GPU.
Yeah, I'm curious about that too!