Hey everyone! We're working with a client who has a large number of Excel and PDF files that come in various formats, and they change often. Some files have a standard tabular structure, others use pivot tables, and some are more complex or semi-structured.
Our goal is to automatically infer the structure of each file, extract the necessary information, and load it into normalized tables in Databricks. There are many different templates now, and new ones will likely appear. Given this variability, what do you recommend for the tech stack, pipeline, and architecture? Should we choose Document Intelligence or Content Understanding? Are these technologies reliable enough for correctly interpreting the file formats and extracting the required values?
4 Answers
We've been using Document Intelligence for about five years now, and it's been solid for us, especially since we extract data from over 50 different document types with custom templates. Content Understanding seems newer, and I’m still figuring out its specific advantages, though it can also handle data from audio and video, which is a plus for certain tasks.
Be prepared to put in some work upfront for developing a classification model along with an extraction model. My team uses a classification model that accurately sorts documents into over 80 categories with 96% accuracy. After identifying the document type, you can utilize key-value pairs from the initial read call with some custom logic. If you aim for a normalized tabular structure, you can potentially use a transformer to align your key-value pairs with the desired format. Just a heads up, we do this professionally for enterprise clients at about $230 an hour.
I suggest trying both Document Intelligence and Content Understanding to see which one actually fits your needs better. Hands-on experience would provide clearer insights.
One approach might be to leverage AI and convert your documents into JSON format. With today's technology, it’s pretty straightforward to use a cheaper LLM along with structured output. Instead of relying solely on cloud services that can be costly and restrictive, consider utilizing some of the excellent open-source options out there.
That’s an interesting take! I totally agree about exploring open-source tools. They can really help avoid vendor lock-in and unnecessary costs.

Content Understanding definitely has its benefits! I think the real strength lies in its ability to create custom schemas that automatically find the needed information.