I'm curious if anyone here is taking on the task of extracting financial data from 10-K and 10-Q reports. Specifically, I'm looking for ways to pull data from the income statement (like revenue and net income), balance sheet (assets and liabilities), and cash flow statement (cash flow from operations, investments, and financing). Are you using any particular methods, such as parsing iXBRL tags or leveraging large language models (LLMs)? I'd love to hear about the approaches you're using and their respective pros and cons!
5 Answers
iXBRL is machine-readable, so it’s definitely worth considering that approach for accuracy! It can streamline the extraction process significantly compared to manual methods.
Check out the [edgartools library](https://github.com/dgunning/edgartools), it's a straightforward tool for working with Edgar data! Might be useful for your extraction needs.
That's an interesting topic! Maybe look into some YouTube tutorials, they might have practical insights.
Extracting clean data from 10-Ks and 10-Qs can be challenging. I use an LLM to process the PDFs of companies in my portfolio. It helps me pull out key figures like revenue and cash flows quickly, but I do have to be careful since it can sometimes misinterpret the data. Are you looking to analyze a personal portfolio, scout for new investments, or do broader market research? That could totally affect which method is more suitable for you.
I'm envisioning a tool where users can upload reports and receive structured data back. A hybrid model sounds great—using iXBRL parsing for efficiency but having LLM as a backup for PDFs!
Sounds like a solid plan! Combining approaches could really enhance accuracy.
I’ve been using the SimFin API for a few years now, and it’s fantastic for bulk downloading stock data. It allows me to build custom metrics and analyze industry averages based on my own criteria. Your method really depends on your objectives, though!

Got any specific links? I'd love to check them out!