Best AWS Solutions for Running OCR on PDFs: EC2, EKS, or Lambda?

0
1
Asked By CuriousCoder42 On

I'm currently trying to set up OCR workflows for academic PDFs on AWS and wanted to get some insights from others. The PDFs I'm dealing with vary quite a bit—some are scanned, while others have complicated layouts with multiple columns, footnotes, and formulas. My objective is to convert these PDFs into tidy Markdown for further processing. I've started experimenting locally using Tesseract with Docker and have also checked out OCRFlux, which seems better at handling cross-page tables and multilingual text.

Here's a breakdown of my experience so far:
1. **EC2** (g4dn/x86 instance): This setup is straightforward and works well for running OCRFlux with CUDA support. It's cost-effective for batch jobs a couple of times a week, but keeping an instance running for burst jobs feels inefficient.

2. **Lambda** (with Tesseract layers): I tried to implement a lightweight version of Tesseract in Lambda. It functions okay for single-page PDFs but struggles with memory limits and timeouts on larger files, plus there's no GPU support, which hurts performance.

3. **EKS with GPU nodes**: This option was the most complex to set up, but it offers robust scalability. I containerized OCRFlux and created a controller to manage document intake and output to S3, using k8s Batch Jobs. This works great for running several PDFs, but costs can add up with GPU usage.

I'm still on the lookout for the best balance between cost and orchestration ease for smaller volumes, say 500 to 1000 PDFs monthly. I also wonder if anyone has had luck with Batch or Fargate for this kind of task since Lambda feels restrictive and EC2 is a bit too hands-on for a job that'll queue up. Additionally, have folks considered offloading OCR to tools like Textract or Comprehend, despite concerns about layout fidelity? Any experiences with OCRFlux or other modern parsing solutions in the cloud would be super helpful!

5 Answers

Answered By OCRExpert99 On

Have you tested Textract or Mistral OCR? Depending on your PDF layouts, they might be useful alternatives.

Answered By CloudWizard88 On

If GPU isn’t essential for your needs, consider using trigger.dev. They offer 16 GB RAM instances with queue capabilities, so it might fit better for single-doc OCR without the GPU hassle.

Answered By MarkdownMaestro On

You might want an open-source solution that doesn’t rely on GPU, running it on Lambda. While Textractor can extract data, converting it to Markdown can get tricky—it’s not straightforward. An existing library for this would be a huge help!

Answered By VisionaryAI On

Why not explore Rekognition? It could cover some OCR needs as well.

Answered By TechieTim123 On

Why didn't you start with Amazon Textract? It might simplify your setup right from the get-go!

PDFPalAce -

I was thinking the same! Textract could save a lot of headaches with documents.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.