Applications

Best AWS Options for Running PDF OCR Workflows: EC2, EKS, or Lambda?

June 27, 2025

Asked By TechieExplorer42 On June 27, 2025

I'm experimenting with OCR workflows for academic PDFs on AWS and would love to learn from others' experiences. Some of these PDFs are scanned, while others have complex layouts with multiple columns, footnotes, and occasional formulas. My goal is to transform them into clean Markdown for easier processing. I've been testing locally with Tesseract in Docker and recently explored OCRFlux due to its ability to handle cross-page tables and multilingual data.

Here's what I've tried so far:

1. **EC2 (g4dn/x86 instance)**: This setup was pretty straightforward and worked fine with OCRFlux. The cost is manageable for running batch jobs a few times a week since I can spin the instance down afterward. However, I feel it's a bit wasteful to keep an instance running for a task that has sporadic demand.

2. **Lambda (with Tesseract)**: I attempted to use a lightweight version of Tesseract in Lambda through custom layers. It worked okay for single-page PDFs, but the memory and timeout limits are frustrating when handling larger documents or heavy post-processing. Plus, without GPU support, performance isn't great.

3. **EKS with GPU nodes**: This was the trickiest option to set up but proved to be the most scalable. I containerized OCRFlux and developed a small controller to manage document intake and output to S3. Kicking off jobs via Kubernetes Jobs works well for batching dozens of PDFs, but costs can skyrocket depending on how many nodes and GPU resources I maintain.

I'm still figuring out a few things:
- For smaller volumes (like 500-1000 PDFs a month), what's the best trade-off between cost and orchestration ease?
- Has anyone used Batch or Fargate for similar workloads? Lambda seems restricted, while EC2 feels too "manual" for a job queue.
- I'm also curious if anyone has offloaded the OCR process to services like Textract or Comprehend, though I'm skeptical about their layout fidelity for my needs.

If anyone has dealt with similar OCR workloads on AWS, I'd love to hear how you've approached it, particularly if you've balanced GPU-intensive processing with cost effectiveness. Also, if you've tested OCRFlux or similar modern parsers and how you set them up in the cloud, please share your insights!

5 Answers

Answered By ImageProcessorHero On June 27, 2025

Have you looked into AWS Rekognition? It might also have some capabilities for your needs.

Answered By CloudGuru75 On June 27, 2025

Have you considered using Amazon Textract from the get-go? It's tailored for document processing and could save you setup time on OCR workflows.

CuriousUser98 - June 27, 2025

I second that! Textract might simplify things for you.

Answered By DisklessDreamer11 On June 27, 2025

If GPU isn't absolutely necessary, you might want to check out trigger.dev. They offer 16 GB RAM machines with queue support, which could be beneficial for single document OCR processes without needing GPUs.

Answered By MarkdownMaster On June 27, 2025

You could try an open-source project that doesn’t require a GPU and run it on Lambda. Textract can extract information, but converting that to Markdown requires additional scripting, which can be tricky. Using a library that already handles the transformation would streamline the process.

Answered By DocuAIHelper On June 27, 2025

Have you given Textract or Mistral OCR a shot? It might depend on the types of layouts and formats you're dealing with, but they're worth considering.

Best AWS Options for Running PDF OCR Workflows: EC2, EKS, or Lambda?

5 Answers

Related Questions

Fix Not Being Able To Add New Categories With Intuitive Category Checklist For Wordpress

Get Real User IP Without Installing Cloudflare Apache Module

How to Get Total Line Count In Visual Studio 2013 Without Addons

Install and Configure PhpMyAdmin on Centos 7

How To Setup PostfixAdmin With Dovecot and Postfix Virtual Mailbox

Dovecot Error Unknown database driver mysql

LEAVE A REPLY Cancel reply