Best AWS Setup for Running PDF OCR Workloads: EC2, EKS, or Lambda?

0
3
Asked By TechExplorer42 On

I'm experimenting with OCR workflows for academic PDFs on AWS and would love to hear about others' experiences. My PDFs vary, with some being scanned and others having complex layouts like multi-columns or footnotes stuffed with text. My aim is to convert these into clean Markdown. So far, I've tried three different AWS services: EC2 with OCRFlux using Docker (works well but feels wasteful for sporadic jobs), Lambda with Tesseract (limited for larger documents), and EKS with GPU nodes (complicated but scalable). I'm trying to figure out what's the best balance between cost and ease of orchestration for about 500–1000 PDFs each month. Has anyone experienced with AWS Batch or Fargate for these workloads? Additionally, has anyone considered using Textract or similar services despite layout concerns? I would appreciate insights, especially on balancing GPU-heavy tasks with keeping costs low, and experiences with OCRFlux or other cloud-based parsing solutions.

5 Answers

Answered By VisionaryTechie On

Have you considered AWS Rekognition? It could offer some useful capabilities for your OCR tasks!

Answered By OCR_Nerd2023 On

Have you explored Textract or Mistral OCR? Depending on the types of layouts you're dealing with, they might be worth looking into.

Answered By CritiqueMaven77 On

Have you thought about using Amazon Textract? It might simplify things for you right from the start, especially with its layout handling capabilities. Just curious why it wasn't your first choice!

TechExplorer42 -

I did consider Textract but wasn't sure it could handle some complex layouts I deal with.

Answered By PracticalCoder99 On

If GPU isn't a necessity, I’d recommend trying trigger.dev. They have machines with 16 GB of RAM that support queuing. Using it for single document OCR hasn't required GPU for us, and it’s been pretty efficient.

Answered By OpenSourceFan92 On

Using an open-source project that doesn't depend on GPU could work too! You can go with Textractor, but remember it only extracts data, so you’d need to script the transformation to Markdown yourself, which can be a hassle if you're dealing with complex layouts. An open-source library that does this would save you some headaches!

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.