I'm diving into setting up OCR workflows on AWS for academic PDFs, some of which are scanned and others with tricky layouts like multi-columns, footnotes, and formulas. I want to convert these into clean Markdown for processing downstream. I've been testing locally with Tesseract in Docker, and also gave OCRFlux a shot since it can handle complex layouts and multilingual content.
So far, I've experimented with three options:
1. **EC2 (g4dn/x86 instance)**: This worked fine for running OCRFlux with CUDA support but feels wasteful for what's a bursty task, even though costs are manageable when I spin down the instance after use.
2. **Lambda (with Tesseract)**: I tried putting a lightweight version of Tesseract in Lambda using custom layers. It works decent for single-page PDFs but struggles with larger documents due to memory and timeout issues. Plus, there's no GPU, which really affects performance on heavier jobs.
3. **EKS with GPU nodes**: Setting this up was a bit of a pain but it's scalable. I containerized OCRFlux and built a controller for document intake. It performs well for batching but costs can sneak up depending on node and GPU usage.
I'm still weighing my options:
- For roughly 500-1000 PDFs a month, what's the best balance of cost and orchestration ease?
- Has anyone had experiences with Batch or Fargate for these workloads? Lambda feels limiting while EC2 is too manual for my needs.
- Also, has anyone used Textract or Comprehend for OCR, even though I'm concerned about layout fidelity?
I'd love to hear about other experiences running document parsing or OCR tasks on AWS, especially balancing GPU-heavy work against cost efficiency. Anyone else using OCRFlux or similar tools? How are you deploying them?
0 Answers
There is no answer to this question yet. If you know the answer or can offer some help, please use the form below.
Related Questions
Fix Not Being Able To Add New Categories With Intuitive Category Checklist For Wordpress
Get Real User IP Without Installing Cloudflare Apache Module
How to Get Total Line Count In Visual Studio 2013 Without Addons
Install and Configure PhpMyAdmin on Centos 7
How To Setup PostfixAdmin With Dovecot and Postfix Virtual Mailbox
Dovecot Error Unknown database driver mysql