Applications

Best AWS Approach for Running PDF OCR Workloads: EC2, EKS, or Lambda?

June 27, 2025

Asked By PixelPioneer42 On June 27, 2025

I'm experimenting with OCR workflows for academic PDFs on AWS and am curious about what others are doing. Some of these PDFs are scanned and have complicated layouts with multi-column text and footnotes, which makes extracting content a challenge. My goal is to convert the PDFs into clean Markdown for easier processing downstream. I've tested a few solutions so far.

1. **EC2 (g4dn/x86 instance)**: This was easy to set up and runs OCRFlux smoothly with CUDA support. The cost is manageable for batch jobs a few times a week, but it feels wasteful to keep an instance running when tasks are bursty.

2. **Lambda (with Tesseract)**: I tried using a lightweight version of Tesseract in Lambda, but it works best for single-page PDFs or simple forms. The limitations on memory and timeouts get tricky with bigger docs, and I'm lacking GPU support, which hurts performance.

3. **EKS with GPU nodes**: Setting this up was complicated but scalable. I containerized OCRFlux and set up a controller for document management. It works well for batches, but costs can climb depending on node and GPU usage.

I'm trying to figure out the best balance between cost and orchestration for about 500-1000 PDFs per month. Has anyone used Batch or Fargate for these workloads? Lambda feels limited, and EC2 seems too hands-on for queuing jobs. Also open to hearing about anyone's experiences with Textract or Comprehend, though they might not deliver on layout fidelity as I need. Would love to hear how you've handled similar OCR tasks.

5 Answers

Answered By LayoutLover234 On June 27, 2025

Have you looked into Textract or Mistral OCR? Not sure about your layout requirements, but they might still be worth testing if you haven’t tried them yet.

Answered By MarkdownGuru On June 27, 2025

You could use an open-source project that runs on Lambda without needing a GPU. Textractor is good for getting information, but be prepared for the transformation to Markdown—it can be quite complex! I recommend finding a library that already does this.

Answered By VisionaryDeveloper On June 27, 2025

Have you considered Amazon Rekognition? It might provide you with another option for processing your documents.

Answered By DevOpsGiant On June 27, 2025

If GPUs aren't a must, I suggest checking out trigger.dev. They offer machines with queue support and enough RAM for document processing without needing a GPU. It could simplify things for single-document OCR tasks.

Answered By TechSavvyNerd On June 27, 2025

Why didn't you consider using Amazon Textract from the start? It's built for document analysis and could save you some setup time. Just curious if you have specific reasons for skipping it!

CuriousCoder88 - June 27, 2025

Yeah, I'm interested in that too! Would love to hear your thoughts.

Best AWS Approach for Running PDF OCR Workloads: EC2, EKS, or Lambda?

5 Answers

Related Questions

Fix Not Being Able To Add New Categories With Intuitive Category Checklist For Wordpress

Get Real User IP Without Installing Cloudflare Apache Module

How to Get Total Line Count In Visual Studio 2013 Without Addons

Install and Configure PhpMyAdmin on Centos 7

How To Setup PostfixAdmin With Dovecot and Postfix Virtual Mailbox

Dovecot Error Unknown database driver mysql

LEAVE A REPLY Cancel reply