I'm working at a small manufacturing company where all our production floor documentation is stuck in Word .docx files. The problem is, the bill of materials data in our current system doesn't always match the info in these files, and our management doesn't quite grasp how these inconsistencies can lead to issues on the floor. We have over 500 active recipes and SKUs! I'm searching for a free and open-source tool or management platform that can help me parse the data out into markdown format so I can use a simple local language model to extract the relevant information. I have a lot of experience with ETL pipelines but this case is a bit tricky since the Word documents vary in format. Any suggestions?
5 Answers
Did you know that .docx files are essentially zip files? You can rename them to zip and extract their contents. So, if you batch rename the files, you can explore the directories and find what you need easily!
If you're into PowerShell, check out the [ImportExcel](https://github.com/dfinke/ImportExcel) module. It can help you export to CSV without needing Excel installed, making it easier to work with LLMs for parsing!
If you’re comfortable using Python, you could use libraries like Python-docx to accomplish this. It might take some effort to set up, but it's definitely doable!
You might want to check out a tool Microsoft created called [Markitdown](https://github.com/microsoft/markitdown). It’s designed for this kind of document conversion and could be exactly what you need!
The Microsoft GitHub really does have some treasures in it!
Pandoc is a great option too! It’s like a universal document converter. Once you convert your documents to markdown, you can keep that as your official version and convert back to .docx when necessary. Super handy!

THANK YOU! THIS is exactly what I was looking for. The research has been a nightmare lately!