What’s the Best Way to Run OCR Workloads on AWS? EC2, EKS, or Lambda?

0
11
Asked By CuriousUser42 On

I'm experimenting with setting up OCR workflows on AWS, specifically for academic PDFs. Some of these PDFs are scanned or have complex layouts, like multi-column text and footnotes, plus occasional formulas. My goal is to convert the content into clean Markdown for downstream processing. I've been testing locally with Tesseract via Docker and OCRFlux, which handles cross-page tables and multilingual text.

Here's what I've tried so far:
1. **EC2 (g4dn/x86 instance)**: It's straightforward and runs OCRFlux well. I installed Docker for local use with CUDA support. While the cost is manageable for batch jobs a few times a week (spinning down the instance afterward), it feels wasteful to leave an instance running for sporadic tasks.

2. **Lambda (with Tesseract)**: I attempted to set up a lightweight Tesseract on Lambda using custom layers. It works for single-page PDFs and simpler forms, but I hit issues with memory and timeouts for larger documents that require heavy post-processing. It also lacks GPU support, resulting in slower performance.

3. **EKS with GPU nodes**: This setup was the trickiest, but it's also the most scalable. I containerized OCRFlux and created a small controller for document intake, sending outputs to S3. This works well for batching a bunch of PDFs, but I need to manage costs since having multiple nodes and GPU resources can add up.

Currently, I'm trying to figure out a few things:
- For small volumes (around 500-1000 PDFs per month), what's the best trade-off between cost and orchestration ease?
- Has anyone used AWS Batch or Fargate for workloads like this? I find Lambda limited, but EC2 seems a bit too "manual" for the queued job approach.
- I'm also curious if anyone has used services like Textract or Comprehend for OCR, even though it seems they might not provide the layout fidelity I need.

If you've run similar document parsing or OCR workloads on AWS, I'd love to hear your approach, especially regarding GPU usage and cost optimization. Also, if you've tested OCRFlux or other modern parsers in the cloud, I'd be interested in your deployment experiences.

5 Answers

Answered By StreamlinedOps21 On

If GPUs aren’t a necessity, check out trigger.dev. They offer 16 GB RAM machines with queue support that work well for document OCR without needing a GPU. Since it's a long-running process, you can optimize your code to avoid memory issues.

Answered By MarkdownMaven77 On

Look for open-source projects that don't require GPUs and can be run on Lambda. Textractor might pull in the data, but transforming it to Markdown will need a separate script, which can be quite challenging. An open-source library could save you time!

Answered By OCRenthusiast88 On

Have you checked out Textract or Mistral OCR? Their features might just suit the variety of formats you're tackling!

Answered By TechGuru99 On

Have you considered using Amazon Textract from the start? It could simplify your process significantly compared to handling everything manually.

InquisitiveMind22 -

Yeah, I was thinking the same. Why not start with Textract?

Answered By VisualTechie11 On

Have you thought about using Rekognition as an alternative? It might have some features you're looking for!

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.