I'm trying to set up OCR workflows on AWS to convert academic PDFs into clean Markdown. These PDFs vary in quality—some are scanned with challenging layouts like multi-columns and footnotes, and occasionally include formulas. I've started locally with Tesseract through Docker and recently tested OCRFlux for its ability to handle complex documents.
Here's what I've explored:
1. **EC2 (g4dn/x86 instance)**: This was pretty straightforward. I got OCRFlux running with Docker and CUDA support. It's cost-effective for batch jobs a few times a week, but it feels wasteful to keep the instance running for sporadic tasks.
2. **Lambda (using Tesseract with custom layers)**: I found that while it works for single-page PDFs, it's pretty limiting with memory and time for larger documents, plus it lacks GPU support.
3. **EKS with GPU nodes**: Setting this up was the most complex, but it's scalable. I managed to batch process several PDFs well, but costs can ramp up with more nodes and GPU usage.
Now I'm looking for insights on:
- What offers the best cost-to-ease ratio for 500-1000 PDFs a month?
- Has anyone tried using Batch or Fargate for similar workloads?
- Also, should I consider using Textract or Comprehend for OCR? They're not ideal for my layout needs.
Any advice on managing document parsing and OCR workloads efficiently on AWS would be appreciated, especially with balancing the cost of GPU parsing.
6 Answers
If you don't need GPU power, take a look at trigger.dev. They offer machines with 16 GB of RAM and queue support, which are great for single document OCR without GPU costs.
Have you thought about trying Textract or even Mistral OCR? Just curious if they could handle your different document formats.
You could try an open-source project that runs on Lambda without needing a GPU. Textractor is good for information extraction, but transitioning it to Markdown is complex. Finding a library that does this effectively would save a ton of time.
Why didn’t you consider using Amazon Textract from the beginning? It could handle the layouts you're working with more efficiently.
Have you looked into using Rekognition? It might be worth checking out for your needs!
That's a good point, I was just exploring a bunch of options to see which fit best.