Best AWS Options for Running PDF OCR Workloads?

0
5
Asked By Aaron Garnes On

I'm diving into setting up OCR workflows on AWS for academic PDFs, some of which are scanned and others with tricky layouts like multi-columns, footnotes, and formulas. I want to convert these into clean Markdown for processing downstream. I've been testing locally with Tesseract in Docker, and also gave OCRFlux a shot since it can handle complex layouts and multilingual content.

So far, I've experimented with three options:
1. **EC2 (g4dn/x86 instance)**: This worked fine for running OCRFlux with CUDA support but feels wasteful for what's a bursty task, even though costs are manageable when I spin down the instance after use.

2. **Lambda (with Tesseract)**: I tried putting a lightweight version of Tesseract in Lambda using custom layers. It works decent for single-page PDFs but struggles with larger documents due to memory and timeout issues. Plus, there's no GPU, which really affects performance on heavier jobs.

3. **EKS with GPU nodes**: Setting this up was a bit of a pain but it's scalable. I containerized OCRFlux and built a controller for document intake. It performs well for batching but costs can sneak up depending on node and GPU usage.

I'm still weighing my options:
- For roughly 500-1000 PDFs a month, what's the best balance of cost and orchestration ease?
- Has anyone had experiences with Batch or Fargate for these workloads? Lambda feels limiting while EC2 is too manual for my needs.
- Also, has anyone used Textract or Comprehend for OCR, even though I'm concerned about layout fidelity?

I'd love to hear about other experiences running document parsing or OCR tasks on AWS, especially balancing GPU-heavy work against cost efficiency. Anyone else using OCRFlux or similar tools? How are you deploying them?

0 Answers

There is no answer to this question yet. If you know the answer or can offer some help, please use the form below.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.