Best Tools for Extracting Bracket Structures from PDFs?

0
12
Asked By CuriousExplorer42 On

I'm trying to figure out how to extract tournament scores and matches from a PDF that has a complex bracket structure. This structure includes multiple rounds with winners and scores for each match, plus there are sometimes empty slots for BYEs and such. I've already given pdfplumber a shot, and I even tried converting the PDF to an image and using Tesseract to read it, but no luck so far. Tesseract tends to misinterpret text, especially Swedish characters, even when I add them to the whitelist. pdfplumber doesn't seem to organize the text in a way that makes sense with the visual columns either. Is there a tool or method out there that can effectively pull this kind of data from a PDF?

1 Answer

Answered By TechSavvyGuru On

Have you looked into tools like Docling or GraniteDocling? They're considered state-of-the-art for tasks like this. They might handle that complexity better than what you’ve tried so far.

CuriousExplorer42 -

Thanks for the suggestion! I did try Docling, but I'm worried about detecting empty player slots and scores. The example I have is just one of many formats, so I'm not sure it's possible to parse all variations. I was hoping for something AI-based that could adapt to this complexity.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.