Applications

Best AWS Options for Running PDF OCR Workloads: EC2, EKS, Lambda, or Textract?

June 27, 2025

Asked By CuriousCoder42 On June 27, 2025

I'm exploring how to set up OCR workflows on AWS for processing academic PDFs. These PDFs vary in quality, with some being scanned and others having difficult layouts like multi-columns and footnotes crammed with text. My ultimate goal is to convert these documents into clean Markdown for easier downstream processing. I started off doing some local testing with Tesseract in a Docker container, and recently, I experimented with OCRFlux, which is good at handling complex content like cross-page tables and multilingual text.

Here's a quick rundown of what I've tried so far:
1. **EC2 (g4dn/x86 instance)**: This was easy to set up and worked well with OCRFlux using CUDA for performance. It's a cost-effective option for running batch jobs a few times weekly, but having a constantly running instance seems inefficient for my sporadic workload.

2. **Lambda (using layers with Tesseract)**: I attempted to load a lighter version of Tesseract into Lambda. It's decent for processing single-page PDFs but struggles with memory limits and timeout issues for larger documents. Plus, it doesn't support GPU, which affects the performance.

3. **EKS with GPU nodes**: Setup for this was quite intricate, but it proved to be the most scalable solution. I managed to containerize OCRFlux, set up a small controller for document processing, and pushed outputs to S3 with k8s Jobs. While effective, it can become costly as I need to keep more nodes and GPUs running for larger batches.

I'm currently trying to figure out what the best mix of cost and orchestration ease is for about 500 to 1000 PDFs each month. Have any of you tried using AWS Batch or Fargate for workloads like this? Lambda seems lacking while EC2 feels overly manual for my needs. Also, has anyone considered using AWS Textract or Comprehend, though I doubt they'd meet my layout requirements? I'd appreciate any insights from those who have experience with similar document parsing and OCR workloads on AWS, especially regarding balancing GPU-heavy processing with cost efficiency.

5 Answers

Answered By ScriptSlinger14 On June 27, 2025

You could try an open-source project that doesn't need GPU on Lambda, like Textractor. It captures information but keeps in mind, converting everything to Markdown takes some custom scripting, which can get really tricky.

Answered By TechieTurtle88 On June 27, 2025

Why didn't you consider using Amazon Textract from the start? It might save you some hassle compared to the other methods you've chosen.

HelpfulHand921 - June 27, 2025

Yeah, I'm curious about that too!

Answered By DataDabbler77 On June 27, 2025

Have you looked into alternatives like Textract or Mistral OCR? I'm not sure how varied your document layouts are, but they could be worth investigating.

Answered By VisionaryVision13 On June 27, 2025

Have you considered Amazon Rekognition? It might be another tool to bring into the mix for your OCR needs.

Answered By CloudGuru33 On June 27, 2025

If GPU isn't strictly necessary for your tasks, you might want to explore trigger.dev. They offer machines with 16 GB of RAM and built-in queue support, which works well for single-document OCR without needing GPU.

Best AWS Options for Running PDF OCR Workloads: EC2, EKS, Lambda, or Textract?

5 Answers

Related Questions

Fix Not Being Able To Add New Categories With Intuitive Category Checklist For Wordpress

Get Real User IP Without Installing Cloudflare Apache Module

How to Get Total Line Count In Visual Studio 2013 Without Addons

Install and Configure PhpMyAdmin on Centos 7

How To Setup PostfixAdmin With Dovecot and Postfix Virtual Mailbox

Dovecot Error Unknown database driver mysql

LEAVE A REPLY Cancel reply