I'm experimenting with setting up OCR workflows on AWS and would love to hear what others are doing! I'm primarily working with academic PDFs, which vary in quality—some are scanned, while others have poor layouts, such as multiple columns, footnotes, and occasional formulas. My ultimate goal is to convert these PDFs into clean Markdown for easier processing downstream.
So far, I've been testing locally with Tesseract (using Docker) and have also checked out OCRFlux, which is great for handling cross-page tables and multilingual content. Here's what I've tried:
1. **EC2 (g4dn/x86 instance)**: This has been pretty straightforward. I can run OCRFlux without issues and I installed Docker with CUDA support. Financially, it's workable for batch jobs a few times per week since I can spin the instance down after use. However, I'm concerned about the efficiency of keeping an instance running for sporadic tasks.
2. **Lambda (using layers + Tesseract)**: I attempted to create a lightweight version of Tesseract using custom layers. It works fine for single-page PDFs or simple forms, but the limits on memory and time can be frustrating for larger files or extensive post-processing. Plus, there's no GPU, so the performance isn't stellar.
3. **EKS with GPU nodes**: This was the toughest setup, but it's also the most scalable. I containerized OCRFlux and set up a small controller to handle document intake, pushing the results to S3. This option works great for batching several dozen PDFs, but I do worry about costs since they can add up with the number of nodes and GPU allocations.
I'm still trying to figure out a few things:
- For smaller volumes of around 500-1000 PDFs a month, what's the best balance between cost and orchestration ease?
- Has anyone utilized Batch or Fargate for this workload? Lambda seems limited, and EC2 feels a bit too "manual" for an efficient queued job flow.
- Additionally, has anyone transferred the OCR task to services like Textract or Comprehend? They don't seem to retain the layout fidelity I need.
If you've tackled similar OCR or document parsing tasks on AWS, I'd be super interested in your approaches, especially regarding balancing GPU usage with cost optimization. Also, has anyone experimented with OCRFlux or other modern parsers in a cloud environment?
4 Answers
Have you thought about using Amazon Textract from the start? It might simplify your process right from the beginning. It’s designed specifically for OCR and layout analysis, so it could handle your academic PDFs well.
If GPU isn't a necessity, you might want to check out trigger.dev. They offer machines with 16 GB of RAM and queue support, which can be great for document processing without needing GPUs. For single document OCR tasks, this setup works really well.
You can use open-source libraries that don’t require GPU and run them on Lambda. While Textract can extract information, you’d still need a script to convert it to Markdown. It can be a bit tricky, so having a library that does the transformation could save you a lot of time.
Have you tried using Textract or Mistral OCR? Those might offer better options depending on the layouts you're dealing with. It's always good to play around with different tools to find the right fit!
Related Questions
Fix Not Being Able To Add New Categories With Intuitive Category Checklist For Wordpress
Get Real User IP Without Installing Cloudflare Apache Module
How to Get Total Line Count In Visual Studio 2013 Without Addons
Install and Configure PhpMyAdmin on Centos 7
How To Setup PostfixAdmin With Dovecot and Postfix Virtual Mailbox
Dovecot Error Unknown database driver mysql