Applications

Best AWS Solutions for Running OCR on PDFs: EC2, EKS, or Lambda?

June 24, 2025

Asked By CuriousCoder42 On June 24, 2025

I'm currently trying to set up OCR workflows for academic PDFs on AWS and wanted to get some insights from others. The PDFs I'm dealing with vary quite a bit—some are scanned, while others have complicated layouts with multiple columns, footnotes, and formulas. My objective is to convert these PDFs into tidy Markdown for further processing. I've started experimenting locally using Tesseract with Docker and have also checked out OCRFlux, which seems better at handling cross-page tables and multilingual text.

Here's a breakdown of my experience so far:
1. **EC2** (g4dn/x86 instance): This setup is straightforward and works well for running OCRFlux with CUDA support. It's cost-effective for batch jobs a couple of times a week, but keeping an instance running for burst jobs feels inefficient.

2. **Lambda** (with Tesseract layers): I tried to implement a lightweight version of Tesseract in Lambda. It functions okay for single-page PDFs but struggles with memory limits and timeouts on larger files, plus there's no GPU support, which hurts performance.

3. **EKS with GPU nodes**: This option was the most complex to set up, but it offers robust scalability. I containerized OCRFlux and created a controller to manage document intake and output to S3, using k8s Batch Jobs. This works great for running several PDFs, but costs can add up with GPU usage.

I'm still on the lookout for the best balance between cost and orchestration ease for smaller volumes, say 500 to 1000 PDFs monthly. I also wonder if anyone has had luck with Batch or Fargate for this kind of task since Lambda feels restrictive and EC2 is a bit too hands-on for a job that'll queue up. Additionally, have folks considered offloading OCR to tools like Textract or Comprehend, despite concerns about layout fidelity? Any experiences with OCRFlux or other modern parsing solutions in the cloud would be super helpful!

5 Answers

Answered By OCRExpert99 On June 26, 2025

Have you tested Textract or Mistral OCR? Depending on your PDF layouts, they might be useful alternatives.

Answered By CloudWizard88 On June 26, 2025

If GPU isn’t essential for your needs, consider using trigger.dev. They offer 16 GB RAM instances with queue capabilities, so it might fit better for single-doc OCR without the GPU hassle.

Answered By MarkdownMaestro On June 24, 2025

You might want an open-source solution that doesn’t rely on GPU, running it on Lambda. While Textractor can extract data, converting it to Markdown can get tricky—it’s not straightforward. An existing library for this would be a huge help!

Answered By VisionaryAI On June 24, 2025

Why not explore Rekognition? It could cover some OCR needs as well.

Answered By TechieTim123 On June 24, 2025

Why didn't you start with Amazon Textract? It might simplify your setup right from the get-go!

PDFPalAce - June 27, 2025

I was thinking the same! Textract could save a lot of headaches with documents.

Best AWS Solutions for Running OCR on PDFs: EC2, EKS, or Lambda?

5 Answers

Related Questions

Fix Not Being Able To Add New Categories With Intuitive Category Checklist For Wordpress

Get Real User IP Without Installing Cloudflare Apache Module

How to Get Total Line Count In Visual Studio 2013 Without Addons

Install and Configure PhpMyAdmin on Centos 7

How To Setup PostfixAdmin With Dovecot and Postfix Virtual Mailbox

Dovecot Error Unknown database driver mysql

LEAVE A REPLY Cancel reply