I've been using Textract for document parsing and text extraction, and while it works well, I'm on the lookout for alternatives that can better handle table layouts and save results as markdown strings. I've heard good things about IBM's Docling and FB's Nougat, but I'm really interested in hearing about people's real-world experiences with different tools in production environments. Any suggestions? Also, I just found a fork called MarkItDown API that seems to fit my needs perfectly, thanks to a recommendation!
3 Answers
Another option to look into is Marker by Vik Paruchuri. It might have the functionalities you're after.
I initially thought you were asking about AWS Textract, which is great for handling tables too, by the way. I've been using it for a few years now, and it really does a good job for various document types.
In my case, everything has to run on-premise, unfortunately.
You might want to check out Microsoft's MarkItDown for your needs. It seems to handle markdown outputs quite well!
Awesome! Thank you! Are you using it right now?
Awesome, thanks!