I'm looking to extract images, tables, and figures from research papers. I've tried a few Python libraries like pymupdf and pdffigures2, but I've found them either too slow or their extraction quality is pretty poor. For instance, pymupdf doesn't handle tables at all. I'm curious if Textract or similar paid tools are worth considering for this task.
4 Answers
The best way to know if it's right for you is to give it a try. Textract is a managed OCR platform and has a solid set of features to work with.
I created a repo specifically for extracting fields and tables from images using vision language models. You can check it out [here](https://github.com/NanoNets/docext), and it should help with your table extraction needs. Plus, you can run the whole setup in a Colab notebook linked in the repo!
Textract definitely has content extraction capabilities. You can test it out in their web console; all you need is an AWS account. It’ll cost a few cents, but I think the demo features are free!
I’ve heard that Sonnet 3.5 performs better when it comes to extracting information from images compared to Textract, so that might be worth looking into.
Thanks, I’ll check it out!