Hey everyone! I'm starting to set up OCR workflows on AWS for academic PDFs, which include a mix of scanned pages and documents with messy layouts (think multi-column, footnotes, and occasional formulas). My aim is to convert these into clean Markdown for further processing. I've been testing locally with Tesseract in Docker and recently switched to OCRFlux for its ability to handle complex layouts and multilingual text.
Here's what I've tried so far:
1. **EC2** (g4dn/x86 instance): This is pretty straightforward and allows me to run OCRFlux using Docker with CUDA support. It's manageable cost-wise for batch jobs a few times a week, but it seems wasteful to have an instance running for bursty tasks.
2. **Lambda** (with layers + Tesseract): I attempted to use a lightweight version of Tesseract in Lambda which works fine for single-page PDFs and basic forms, but the memory and timeout limitations are a hassle for larger documents. Plus, there's no GPU support, so performance is limited.
3. **EKS with GPU nodes**: Setting this up was the most complex, but it's also the most scalable. I containerized OCRFlux and created a controller for document intake and output to S3, using k8s Jobs for processing. While great for batching multiple PDFs, costs can escalate depending on GPU allocation.
I'm still trying to figure out:
- For a volume of around 500-1000 PDFs a month, what's the best way to balance cost with ease of orchestration?
- Has anyone here used Batch or Fargate for similar workloads? It feels like Lambda is pretty limited while EC2 is too manual for a queued workflow.
- Also, has anyone considered using Textract or Comprehend for OCR? They seem to fall short for the kind of layout fidelity I'm looking for.
I'd love to hear your experiences if you've tackled similar document parsing or OCR workflows on AWS, especially regarding how to manage GPU-heavy tasks while keeping costs down. Also, if anyone has tried OCRFlux or other modern parsers in the cloud, I'd be keen to know how you deployed them!
5 Answers
Hey, why didn’t you consider using Amazon Textract right off the bat? It might handle your needs pretty well given its capabilities with structured documents.
You could look for an open-source project that doesn’t require a GPU and run it on Lambda. Textractor can get the information out, but the conversion to Markdown will need a script, and trust me, that’s not a simple task. An open-source library that handles this transformation would be a better option.
What about using Rekognition? It might be another tool worth considering for your OCR needs.
If you don’t necessarily need GPU support, you should check out trigger.dev. They offer machines with 16 GB of RAM and queue support, which could work well for your single document OCR tasks without needing any heavy GPU setup.
Have you given Textract or Mistral OCR a try? I’m not sure how varied your layouts are, but those might be useful options to explore.
Yeah, I was wondering the same thing!