I'm setting up OCR workflows on AWS to convert academic PDFs—some scanned and others with tricky layouts like multi-columns and footnotes—into clean Markdown for processing. I started off testing locally with Tesseract and also tried OCRFlux, which works better with complex layouts. Here's what I've done so far:
1. **EC2 (g4dn/x86)** - Works well, good for batch processing, but it's a bit wasteful to keep an instance running for my bursty tasks.
2. **Lambda (Tesseract)** - Lightweight version works fine for simple PDFs but runs into memory and timeout issues with larger documents. No GPU means performance is lacking.
3. **EKS with GPU Nodes** - The toughest to set up but scalable; I had to containerize OCRFlux and handle job intake manually, which adds complexity but works for larger batches.
I'm wondering about the best mix of cost and orchestration for processing about 500-1000 PDFs a month. What's your experience with Batch or Fargate for these workloads? Also, have any of you used Textract or Comprehend for OCR, considering their limitations with layout fidelity? Would love to hear how others are managing similar projects, especially balancing heavy GPU needs with cost efficiency. Any insights would be greatly appreciated!
5 Answers
Why not also check out Rekognition? It might offer solutions that pair well with your OCR workflows.
If you don’t need GPU power, consider using trigger.dev. They provide 16 GB RAM machines with queue support, which might fit your document processing needs without the overhead of managing GPUs.
Have you given Mistral OCR a shot in addition to Textract? It really depends on your document formats, but it could be a solid alternative for complex layouts.
Have you thought about using Amazon Textract from the start? It might solve some of your layout issues without heavy lifting. Just curious why it wasn't your first choice!
You could look into using an open-source solution that doesn't require a GPU on Lambda. Textractor might get your data, but transforming it to Markdown will need a fair bit of scripting, so be prepared for that!
Yeah, I was wondering the same thing! Textract could simplify things.