Applications

Best Way to Run PDF OCR Workloads on AWS: EC2, EKS, or Lambda?

June 24, 2025

Asked By TechieTwilight87 On June 24, 2025

Hey everyone! I'm starting to set up OCR workflows on AWS for academic PDFs, which include a mix of scanned pages and documents with messy layouts (think multi-column, footnotes, and occasional formulas). My aim is to convert these into clean Markdown for further processing. I've been testing locally with Tesseract in Docker and recently switched to OCRFlux for its ability to handle complex layouts and multilingual text.

Here's what I've tried so far:
1. **EC2** (g4dn/x86 instance): This is pretty straightforward and allows me to run OCRFlux using Docker with CUDA support. It's manageable cost-wise for batch jobs a few times a week, but it seems wasteful to have an instance running for bursty tasks.
2. **Lambda** (with layers + Tesseract): I attempted to use a lightweight version of Tesseract in Lambda which works fine for single-page PDFs and basic forms, but the memory and timeout limitations are a hassle for larger documents. Plus, there's no GPU support, so performance is limited.
3. **EKS with GPU nodes**: Setting this up was the most complex, but it's also the most scalable. I containerized OCRFlux and created a controller for document intake and output to S3, using k8s Jobs for processing. While great for batching multiple PDFs, costs can escalate depending on GPU allocation.

I'm still trying to figure out:
- For a volume of around 500-1000 PDFs a month, what's the best way to balance cost with ease of orchestration?
- Has anyone here used Batch or Fargate for similar workloads? It feels like Lambda is pretty limited while EC2 is too manual for a queued workflow.
- Also, has anyone considered using Textract or Comprehend for OCR? They seem to fall short for the kind of layout fidelity I'm looking for.

I'd love to hear your experiences if you've tackled similar document parsing or OCR workflows on AWS, especially regarding how to manage GPU-heavy tasks while keeping costs down. Also, if anyone has tried OCRFlux or other modern parsers in the cloud, I'd be keen to know how you deployed them!

5 Answers

Answered By CloudNinja123 On June 27, 2025

Hey, why didn’t you consider using Amazon Textract right off the bat? It might handle your needs pretty well given its capabilities with structured documents.

DataDiver99 - June 27, 2025

Yeah, I was wondering the same thing!

Answered By OpenSourceSavant On June 27, 2025

You could look for an open-source project that doesn’t require a GPU and run it on Lambda. Textractor can get the information out, but the conversion to Markdown will need a script, and trust me, that’s not a simple task. An open-source library that handles this transformation would be a better option.

Answered By VisionaryCaptain On June 26, 2025

What about using Rekognition? It might be another tool worth considering for your OCR needs.

Answered By GPUGuru42 On June 25, 2025

If you don’t necessarily need GPU support, you should check out trigger.dev. They offer machines with 16 GB of RAM and queue support, which could work well for your single document OCR tasks without needing any heavy GPU setup.

Answered By ReaderTech77 On June 25, 2025

Have you given Textract or Mistral OCR a try? I’m not sure how varied your layouts are, but those might be useful options to explore.

Best Way to Run PDF OCR Workloads on AWS: EC2, EKS, or Lambda?

5 Answers

Related Questions

Fix Not Being Able To Add New Categories With Intuitive Category Checklist For Wordpress

Get Real User IP Without Installing Cloudflare Apache Module

How to Get Total Line Count In Visual Studio 2013 Without Addons

Install and Configure PhpMyAdmin on Centos 7

How To Setup PostfixAdmin With Dovecot and Postfix Virtual Mailbox

Dovecot Error Unknown database driver mysql

LEAVE A REPLY Cancel reply