I'm experimenting with OCR workflows for academic PDFs on AWS and would love to hear about others' experiences. My PDFs vary, with some being scanned and others having complex layouts like multi-columns or footnotes stuffed with text. My aim is to convert these into clean Markdown. So far, I've tried three different AWS services: EC2 with OCRFlux using Docker (works well but feels wasteful for sporadic jobs), Lambda with Tesseract (limited for larger documents), and EKS with GPU nodes (complicated but scalable). I'm trying to figure out what's the best balance between cost and ease of orchestration for about 500–1000 PDFs each month. Has anyone experienced with AWS Batch or Fargate for these workloads? Additionally, has anyone considered using Textract or similar services despite layout concerns? I would appreciate insights, especially on balancing GPU-heavy tasks with keeping costs low, and experiences with OCRFlux or other cloud-based parsing solutions.
5 Answers
Have you considered AWS Rekognition? It could offer some useful capabilities for your OCR tasks!
Have you explored Textract or Mistral OCR? Depending on the types of layouts you're dealing with, they might be worth looking into.
Have you thought about using Amazon Textract? It might simplify things for you right from the start, especially with its layout handling capabilities. Just curious why it wasn't your first choice!
If GPU isn't a necessity, I’d recommend trying trigger.dev. They have machines with 16 GB of RAM that support queuing. Using it for single document OCR hasn't required GPU for us, and it’s been pretty efficient.
Using an open-source project that doesn't depend on GPU could work too! You can go with Textractor, but remember it only extracts data, so you’d need to script the transformation to Markdown yourself, which can be a hassle if you're dealing with complex layouts. An open-source library that does this would save you some headaches!
I did consider Textract but wasn't sure it could handle some complex layouts I deal with.