Best AWS Approach for Running PDF OCR Workloads: EC2, EKS, or Lambda?

0
0
Asked By PixelPioneer42 On

I'm experimenting with OCR workflows for academic PDFs on AWS and am curious about what others are doing. Some of these PDFs are scanned and have complicated layouts with multi-column text and footnotes, which makes extracting content a challenge. My goal is to convert the PDFs into clean Markdown for easier processing downstream. I've tested a few solutions so far.

1. **EC2 (g4dn/x86 instance)**: This was easy to set up and runs OCRFlux smoothly with CUDA support. The cost is manageable for batch jobs a few times a week, but it feels wasteful to keep an instance running when tasks are bursty.

2. **Lambda (with Tesseract)**: I tried using a lightweight version of Tesseract in Lambda, but it works best for single-page PDFs or simple forms. The limitations on memory and timeouts get tricky with bigger docs, and I'm lacking GPU support, which hurts performance.

3. **EKS with GPU nodes**: Setting this up was complicated but scalable. I containerized OCRFlux and set up a controller for document management. It works well for batches, but costs can climb depending on node and GPU usage.

I'm trying to figure out the best balance between cost and orchestration for about 500-1000 PDFs per month. Has anyone used Batch or Fargate for these workloads? Lambda feels limited, and EC2 seems too hands-on for queuing jobs. Also open to hearing about anyone's experiences with Textract or Comprehend, though they might not deliver on layout fidelity as I need. Would love to hear how you've handled similar OCR tasks.

5 Answers

Answered By LayoutLover234 On

Have you looked into Textract or Mistral OCR? Not sure about your layout requirements, but they might still be worth testing if you haven’t tried them yet.

Answered By MarkdownGuru On

You could use an open-source project that runs on Lambda without needing a GPU. Textractor is good for getting information, but be prepared for the transformation to Markdown—it can be quite complex! I recommend finding a library that already does this.

Answered By VisionaryDeveloper On

Have you considered Amazon Rekognition? It might provide you with another option for processing your documents.

Answered By DevOpsGiant On

If GPUs aren't a must, I suggest checking out trigger.dev. They offer machines with queue support and enough RAM for document processing without needing a GPU. It could simplify things for single-document OCR tasks.

Answered By TechSavvyNerd On

Why didn't you consider using Amazon Textract from the start? It's built for document analysis and could save you some setup time. Just curious if you have specific reasons for skipping it!

CuriousCoder88 -

Yeah, I'm interested in that too! Would love to hear your thoughts.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.