I've been working on a project that involves a lot of technical jargon and acronyms, and I want to keep a page updated with all the acronyms and specialized terms we use. I think it would be great to develop a tool that scans through source files and documentation to generate a list of acronyms automatically. My idea is to focus on alphanumeric sequences that contain at least one capital letter (ignoring the first character) and to exclude code, only looking at comments and docstrings. I also want to include features like user-defined regex patterns and a list of acronyms to ignore. Is there a tool that already does this? If not, does my idea make sense?
6 Answers
For a basic version, consider using an AST parser that can grab comments and docstrings. Libraries like libcst might be useful here! Or, you could take all the text and run it through a language model. It’s a fuzzy problem, and LLMs are pretty good at figuring that stuff out.
Identifying acronyms can be tricky. You might want to define what exactly makes an acronym for your tool, like requiring a certain letter casing. But it sounds like you're okay with some false positives since you want to catch as many as possible. Just keep that in mind while developing your package!
I don’t quite understand what you’re proposing, but it sounds interesting!
How can it sound interesting if you don’t get it?
I actually built something similar with VBA to search Word documents using regex patterns. It was straightforward—just a few lines to catch acronyms and track whether they’d been defined before. You might want to look into regex for quick solutions; it can work magic!
I’d like to avoid false positives from the actual code, though.
Atlassian has features in Confluence (or maybe Jira) that guess acronyms based on their context across pages. Might be worth checking out for inspiration!
This tool would have been super helpful back when I worked in defense contracting. The acronyms can get overwhelming!
Yeah, I think it’s more important to have a wider identification range and deal with any mistakes later.