Applications

Best Way to Run PDF OCR on AWS: EC2, EKS, or Lambda?

June 24, 2025

Asked By CuriousCoder42 On June 24, 2025

I'm experimenting with setting up OCR workflows on AWS to handle academic PDFs that can be scanned or have complex layouts like multi-columns and footnotes. My goal is to convert these PDFs into clean Markdown for further processing. Currently, I've been testing locally with Tesseract in Docker and recently tried using OCRFlux since it supports cross-page tables and multiple languages.

Here's what I've tried so far:
1. **EC2**: I used a g4dn/x86 instance which was straightforward for OCRFlux with CUDA support. It's cost-effective if I run batch jobs a few times a week, but it feels wasteful to have an instance constantly running for bursty tasks.
2. **Lambda**: I attempted to squeeze a lightweight version of Tesseract into Lambda, which worked for single-page PDFs but struggled with larger files due to memory and timeout limits. The lack of GPU also affects performance here.
3. **EKS**: This was the most complex but scalable, as I containerized OCRFlux and utilized Kubernetes for document processing. It works well for larger batches, though costs can increase based on GPU use.

Now, I'm curious about the best balance between cost and ease of orchestration for processing 500-1000 PDFs a month. Has anyone had experience with Batch or Fargate for this? Lambda feels limited for the workload, while EC2 seems too manual. Additionally, I wonder if anyone's done OCR with Textract or Comprehend and how well that worked given my need for layout fidelity.

2 Answers

Answered By TechGuru99 On June 26, 2025

Why didn’t you consider using Amazon Textract from the start? It could handle the OCR tasks without much hassle, especially for structured data.

InsightfulTom1 - June 27, 2025

Yeah, I was wondering the same thing. Textract might save you a lot of development time!

Answered By DataWhiz87 On June 24, 2025

If you're not using GPUs, check out Trigger.dev. They have machines with 16 GB RAM and built-in queue support, which works great for single-document OCR without the need for a GPU. You can also code it to manage memory effectively since it's a long-running process.

Best Way to Run PDF OCR on AWS: EC2, EKS, or Lambda?

2 Answers

Related Questions

Fix Not Being Able To Add New Categories With Intuitive Category Checklist For Wordpress

Get Real User IP Without Installing Cloudflare Apache Module

How to Get Total Line Count In Visual Studio 2013 Without Addons

Install and Configure PhpMyAdmin on Centos 7

How To Setup PostfixAdmin With Dovecot and Postfix Virtual Mailbox

Dovecot Error Unknown database driver mysql

LEAVE A REPLY Cancel reply