What’s the Best Way to Run PDF OCR Workloads on AWS?

0
3
Asked By CuriousCoder42 On

I'm trying to set up OCR workflows on AWS to convert academic PDFs into clean Markdown. These PDFs vary in quality—some are scanned with challenging layouts like multi-columns and footnotes, and occasionally include formulas. I've started locally with Tesseract through Docker and recently tested OCRFlux for its ability to handle complex documents.

Here's what I've explored:
1. **EC2 (g4dn/x86 instance)**: This was pretty straightforward. I got OCRFlux running with Docker and CUDA support. It's cost-effective for batch jobs a few times a week, but it feels wasteful to keep the instance running for sporadic tasks.
2. **Lambda (using Tesseract with custom layers)**: I found that while it works for single-page PDFs, it's pretty limiting with memory and time for larger documents, plus it lacks GPU support.
3. **EKS with GPU nodes**: Setting this up was the most complex, but it's scalable. I managed to batch process several PDFs well, but costs can ramp up with more nodes and GPU usage.

Now I'm looking for insights on:
- What offers the best cost-to-ease ratio for 500-1000 PDFs a month?
- Has anyone tried using Batch or Fargate for similar workloads?
- Also, should I consider using Textract or Comprehend for OCR? They're not ideal for my layout needs.

Any advice on managing document parsing and OCR workloads efficiently on AWS would be appreciated, especially with balancing the cost of GPU parsing.

6 Answers

Answered By QuestionAsker On
Answered By CodeNinjaX On

If you don't need GPU power, take a look at trigger.dev. They offer machines with 16 GB of RAM and queue support, which are great for single document OCR without GPU costs.

Answered By PrintWizard99 On

Have you thought about trying Textract or even Mistral OCR? Just curious if they could handle your different document formats.

Answered By MarkdownMaster88 On

You could try an open-source project that runs on Lambda without needing a GPU. Textractor is good for information extraction, but transitioning it to Markdown is complex. Finding a library that does this effectively would save a ton of time.

Answered By TechyTom123 On

Why didn’t you consider using Amazon Textract from the beginning? It could handle the layouts you're working with more efficiently.

QuestionAsker -

That's a good point, I was just exploring a bunch of options to see which fit best.

Answered By ImageAnalyst77 On

Have you looked into using Rekognition? It might be worth checking out for your needs!

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.