Choosing the Best AWS Setup for PDF OCR Workflows

0
7
Asked By CuriousCat123 On

I'm setting up OCR workflows on AWS to convert academic PDFs—some scanned and others with tricky layouts like multi-columns and footnotes—into clean Markdown for processing. I started off testing locally with Tesseract and also tried OCRFlux, which works better with complex layouts. Here's what I've done so far:

1. **EC2 (g4dn/x86)** - Works well, good for batch processing, but it's a bit wasteful to keep an instance running for my bursty tasks.
2. **Lambda (Tesseract)** - Lightweight version works fine for simple PDFs but runs into memory and timeout issues with larger documents. No GPU means performance is lacking.
3. **EKS with GPU Nodes** - The toughest to set up but scalable; I had to containerize OCRFlux and handle job intake manually, which adds complexity but works for larger batches.

I'm wondering about the best mix of cost and orchestration for processing about 500-1000 PDFs a month. What's your experience with Batch or Fargate for these workloads? Also, have any of you used Textract or Comprehend for OCR, considering their limitations with layout fidelity? Would love to hear how others are managing similar projects, especially balancing heavy GPU needs with cost efficiency. Any insights would be greatly appreciated!

5 Answers

Answered By VisionaryView44 On

Why not also check out Rekognition? It might offer solutions that pair well with your OCR workflows.

Answered By CloudyCloud9 On

If you don’t need GPU power, consider using trigger.dev. They provide 16 GB RAM machines with queue support, which might fit your document processing needs without the overhead of managing GPUs.

Answered By OCRExplorer55 On

Have you given Mistral OCR a shot in addition to Textract? It really depends on your document formats, but it could be a solid alternative for complex layouts.

Answered By TechieTed98 On

Have you thought about using Amazon Textract from the start? It might solve some of your layout issues without heavy lifting. Just curious why it wasn't your first choice!

PlayfulPanda45 -

Yeah, I was wondering the same thing! Textract could simplify things.

Answered By MarkdownMagician77 On

You could look into using an open-source solution that doesn't require a GPU on Lambda. Textractor might get your data, but transforming it to Markdown will need a fair bit of scripting, so be prepared for that!

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.