I'm trying to figure out a way to automatically determine whether a codebase is considered "legacy" or "modern." We had to evaluate a tech stack recently and needed to see how outdated it was and how much work migrating it would require. Manually examining the repository took ages, so I decided to automate part of the process.
I came up with a simple heuristic that scans source files for specific keywords to count occurrences of "classic" versus "modern" indicators. Here's the setup:
- Classic indicators include terms like 'angularjs,' 'jquery,' and 'python2'.
- Modern indicators include keywords like 'react18', 'vue3', 'python3.12', and 'typescript5'.
I tally how many classic or modern keywords appear and determine the codebase's era based on which count is higher. I've tested this on a few projects but encountered flaws, like misclassifying a modern Flask app because it uses a classic framework.
Some known issues include:
- My keyword approach lacks sophistication—just counting occurrences doesn't give a real picture.
- Mixed codebases can skew results, causing a modern project to appear classic simply by having legacy code intermixed.
- I only analyze the first 10KB of each file, which might miss critical modern imports later in larger files.
I'm curious if there are better methods for determining a codebase's age than my current keyword counting. I've been thinking about directly checking version numbers in package.json or requirements.txt instead, since they provide concrete data.
5 Answers
It sounds like you're on the right track with your idea about checking package.json and requirements.txt. The version numbers there are much more reliable than trying to guess based on keywords. For instance, knowing you're dealing with React 19 versus React 16 tells you a lot more.
Also, consider looking into the age of any lockfiles you have. If the package-lock.json hasn't been updated in a few years, that's a much stronger indicator of a legacy codebase than anything else.
To me, a codebase is legacy if it has no tests in place. I mean, we can have the latest versions, but without testing, it’s hard to be sure. A site with solid test coverage is far more reliable, no matter the tech stack it’s built on.
Forget the keywords; just check the dependencies. If you see only classic libraries, it’s likely legacy, and if it's all modern, then it’s not. If there’s a mix, you either have a messy codebase or an upgrade that never fully happened.
Why not just use AI for this assessment? Formalizing the detection is tricky, and AI could handle the nuances better than any rule you can define.
If a project only uses ESM, that's a good sign it’s modern. But don't just stop there. Look at the dependency graph — legacy projects often have complex interdependencies that can point to deeper issues. Simple keyword counts can miss these nuances, like a React app with poor structural integrity due to circular dependencies and dead modules everywhere.

Exactly! Just because it uses a modern framework doesn't mean it’s well-architected. Diving into the structure can reveal way more than you’d guess from just a few keywords.