Applications

Best AWS Solution for Running PDF OCR Workloads: EC2, EKS, or Lambda?

June 24, 2025

Asked By CuriousCaper77 On June 24, 2025

I've been experimenting with setting up OCR workflows on AWS for processing academic PDFs. Some of these PDFs are scanned, and many have complicated layouts including multi-column formats, footnotes, and formulas. My goal is to convert these PDFs into clean Markdown for further use. I've started testing locally with Tesseract (using Docker) and have also tried OCRFlux, which handles cross-page tables and multilingual content well.

Here's what I've tested so far:
1. **EC2 (g4dn/x86 instance)** - This has been pretty straightforward and I can run OCRFlux just fine. With Docker and CUDA support, it's manageable cost-wise for batch jobs a few times a week, though it does feel wasteful to keep an instance running all the time.

2. **Lambda (with Tesseract via layers)** - I attempted to use a lightweight version of Tesseract on Lambda. It works okay for single-page PDFs but struggles with larger documents due to memory and timeout limits. Plus, without GPU support, the performance isn't ideal.

3. **EKS with GPU nodes** - This setup was the most complex but allowed great scalability. I containerized OCRFlux, created a controller for document management, and used Kubernetes Jobs to kick off processing. Although this works efficiently for larger batches, costs can increase depending on GPU requirements.

I'm still figuring things out and would like to know:
- What's the best balance of cost and orchestration for relatively small volumes (around 500 to 1000 PDFs a month)?
- Has anyone tried using Batch or Fargate for such workloads? Lambda seems limited while EC2 feels too manual for a queued job setup.
- Additionally, has anyone considered using services like Textract or Comprehend for OCR, even though they might struggle with layout fidelity?

I'd love to hear from anyone who's tackled similar document parsing or OCR workloads on AWS, especially if you're managing costs while handling GPU-intensive parsing. Also curious about any experiences with OCRFlux or other modern parsers and how you're deploying them in the cloud.

5 Answers

Answered By OCRninja On June 28, 2025

What about Amazon Rekognition? It could be worth considering for your needs as well!

Answered By MarkdownMaven On June 27, 2025

You could look at open-source projects that run without a GPU. Using something like Textractor might help, but remember that you’ll still need to develop a script to transform data into Markdown, and that can get tricky!

Answered By CloudExplorer21 On June 25, 2025

If GPU isn't essential, check out trigger.dev. They offer machines with 16 GB RAM and queue support, which could work well for single document OCR without needing a GPU. It's a long-running process, so ensure you manage memory properly.

Answered By DocumentDynamo On June 24, 2025

Have you tried Textract or Mistral OCR? I know those options vary in their layout handling, but they might offer you some flexibility depending on your files' formats.

Answered By TechieGuru99 On June 24, 2025

Have you considered using Amazon Textract from the start? It could potentially save you some hassle with OCR processing, especially given your use case with PDF layouts.

QuestionAuthor77 - June 27, 2025

That's a good point! I'm wondering though if Textract can handle some of the more complicated layouts I'm working with.

Best AWS Solution for Running PDF OCR Workloads: EC2, EKS, or Lambda?

5 Answers

Related Questions

Fix Not Being Able To Add New Categories With Intuitive Category Checklist For Wordpress

Get Real User IP Without Installing Cloudflare Apache Module

How to Get Total Line Count In Visual Studio 2013 Without Addons

Install and Configure PhpMyAdmin on Centos 7

How To Setup PostfixAdmin With Dovecot and Postfix Virtual Mailbox

Dovecot Error Unknown database driver mysql

LEAVE A REPLY Cancel reply