I'm looking to extract not just the text from a DOCX file but also a lot of detailed formatting information. I want to capture things like page margins, bold and underline styles, text alignment (left, right, center, justified), as well as newlines, spaces, tabs, bullet points, numbered lists, and even tables. I explored using `python-docx`, but it seems limited; it allows access to basic formatting like bold/underline and paragraph alignment, but I can't find a way to get deeper details such as ruler positions, custom tab stops, or bullet styles. Has anyone figured out how to tackle this? Are there any other Python libraries or methods beyond `python-docx` that can help me extract this level of detail? Any tips, code snippets, or resources would be super helpful!
1 Answer
You might want to consider unpacking the DOCX file directly since it's essentially a zipped archive containing XML files. This way, you can access all the raw data. While it can be overwhelming due to the amount of XML, with some parsing, you can find the information you need. If you want a better balance than diving deep into the XML, you could also check out libraries like `lxml` for XML parsing; it could simplify accessing the specific formatting details you're after.
I tried unpacking it too, but I found the XML quite daunting. I'm hoping for something that’s more streamlined. If I have to, I might just learn how to handle the XML, but it's a lot to process.