Best AWS Solution for Running PDF OCR Workloads: EC2, EKS, or Lambda?

0
2
Asked By CuriousCaper77 On

I've been experimenting with setting up OCR workflows on AWS for processing academic PDFs. Some of these PDFs are scanned, and many have complicated layouts including multi-column formats, footnotes, and formulas. My goal is to convert these PDFs into clean Markdown for further use. I've started testing locally with Tesseract (using Docker) and have also tried OCRFlux, which handles cross-page tables and multilingual content well.

Here's what I've tested so far:
1. **EC2 (g4dn/x86 instance)** - This has been pretty straightforward and I can run OCRFlux just fine. With Docker and CUDA support, it's manageable cost-wise for batch jobs a few times a week, though it does feel wasteful to keep an instance running all the time.

2. **Lambda (with Tesseract via layers)** - I attempted to use a lightweight version of Tesseract on Lambda. It works okay for single-page PDFs but struggles with larger documents due to memory and timeout limits. Plus, without GPU support, the performance isn't ideal.

3. **EKS with GPU nodes** - This setup was the most complex but allowed great scalability. I containerized OCRFlux, created a controller for document management, and used Kubernetes Jobs to kick off processing. Although this works efficiently for larger batches, costs can increase depending on GPU requirements.

I'm still figuring things out and would like to know:
- What's the best balance of cost and orchestration for relatively small volumes (around 500 to 1000 PDFs a month)?
- Has anyone tried using Batch or Fargate for such workloads? Lambda seems limited while EC2 feels too manual for a queued job setup.
- Additionally, has anyone considered using services like Textract or Comprehend for OCR, even though they might struggle with layout fidelity?

I'd love to hear from anyone who's tackled similar document parsing or OCR workloads on AWS, especially if you're managing costs while handling GPU-intensive parsing. Also curious about any experiences with OCRFlux or other modern parsers and how you're deploying them in the cloud.

5 Answers

Answered By OCRninja On

What about Amazon Rekognition? It could be worth considering for your needs as well!

Answered By MarkdownMaven On

You could look at open-source projects that run without a GPU. Using something like Textractor might help, but remember that you’ll still need to develop a script to transform data into Markdown, and that can get tricky!

Answered By CloudExplorer21 On

If GPU isn't essential, check out trigger.dev. They offer machines with 16 GB RAM and queue support, which could work well for single document OCR without needing a GPU. It's a long-running process, so ensure you manage memory properly.

Answered By DocumentDynamo On

Have you tried Textract or Mistral OCR? I know those options vary in their layout handling, but they might offer you some flexibility depending on your files' formats.

Answered By TechieGuru99 On

Have you considered using Amazon Textract from the start? It could potentially save you some hassle with OCR processing, especially given your use case with PDF layouts.

QuestionAuthor77 -

That's a good point! I'm wondering though if Textract can handle some of the more complicated layouts I'm working with.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.