Has anyone used Textract for extracting images and tables from PDFs?

0
0
Asked By CloudyNinja47 On

I'm looking to extract images, tables, and figures from research papers. I've tried a few Python libraries like pymupdf and pdffigures2, but I've found them either too slow or their extraction quality is pretty poor. For instance, pymupdf doesn't handle tables at all. I'm curious if Textract or similar paid tools are worth considering for this task.

4 Answers

Answered By QuestionMasterX On

The best way to know if it's right for you is to give it a try. Textract is a managed OCR platform and has a solid set of features to work with.

Answered By TechyGuru21 On

I created a repo specifically for extracting fields and tables from images using vision language models. You can check it out [here](https://github.com/NanoNets/docext), and it should help with your table extraction needs. Plus, you can run the whole setup in a Colab notebook linked in the repo!

CloudyNinja47 -

Thanks, I’ll check it out!

Answered By SkyHighCoder On

Textract definitely has content extraction capabilities. You can test it out in their web console; all you need is an AWS account. It’ll cost a few cents, but I think the demo features are free!

Answered By PixelPioneer42 On

I’ve heard that Sonnet 3.5 performs better when it comes to extracting information from images compared to Textract, so that might be worth looking into.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.