I'm interested in hearing how different teams monitor configuration drift across their environments, like production, staging, and testing. In the past, I've experienced issues where unnoticed config differences led to incidents. Changes happen, deployments go through, and sometimes it's not until something breaks that we notice the discrepancies. While some tools focus on file differences or configuration management, they often overlook the ongoing detection of drift. I've started to experiment with a tool that scans configurations from a Git repo, defines a baseline state, runs regular scans against that baseline, and alerts us to any drift. I'm currently testing this with configurations like .NET's appsettings.json and IIS's web.config. I'd love to know how other teams tackle this problem in real-life scenarios.
5 Answers
Most teams I know focus on preventing drift rather than merely detecting it by using Infrastructure as Code (IaC) approaches, pulling configurations directly from Git during deployments. However, the scheduled checks you’re implementing are really useful, especially for catching those manual changes that sneak through.
We follow a GitOps model, ensuring that no one can just tweak things on a whim. Using tools like Terraform and Ansible helps us maintain consistency, and everything has to be reflected in the code base to keep things tidy. Scheduled checks against known baselines are handy for catching any manual changes that slip through.
I’ve personally used a tool called StratoLens for tracking state and config drift, particularly with Azure setups. It’s pretty effective for infrastructure but doesn’t quite dig deep into AKS or VM OS levels, which is a bit limiting.
We rely heavily on automation tools. When we detect drift, we either roll back the unsanctioned changes or replace the entire system, especially with something like golden images.
We try a strict approach by not allowing anyone direct access to servers. Instead, we provision standard server images from the start. For any changes, we use proper automation tools to manage things like Ansible and set limitations on human access. If drift is detected, it gets reported, but ultimately we just avoid human interaction with the servers entirely!
I hear you, but if someone logs in, it's back to square one! Regenerate and reset after a while to ensure no drift.

Are there tools that address that gap? I'd love to know!