I'm really curious about how teams manage drift between what's in Git and what's actually running in their clusters. I'm not referring to the obvious sync failures, but more to the gradual changes like manual kubectl fixes, urgent hotfixes during incidents, changes made by operators, or even upgrades that slightly alter the state. What are your strategies for catching drift early instead of discovering it weeks later? Do you set up alerts for it, run diffs, or simply rely on re-syncs? And once you identify drift, how do you handle it in practice? Do you use auto-reverts, pull requests, or some manual cleanup? It seems like everyone is on the GitOps bandwagon, yet the "day 2" issues with drift are still quite messy. I'd love to hear about real-world setups rather than just theoretical answers!
10 Answers
In a small team, we use a full CICD+IaC setup. Sure, drift happens, but we each take accountability for our actions. Mistakes can be minor, and we focus on improving our processes rather than obsessing over perfect practices. On larger teams, especially with multiple developers active in the clusters, it's crucial to detect drift quickly.
We handle it through hot fixes via GitOps, but I think it really depends on team discipline. I've witnessed systems fall apart under pressure, with backports either happening late or not at all. I'm interested in how you maintain that discipline when everything's on fire!
Yeah, that method can easily fall apart in a crisis. What strategies do you have in place to enforce compliance during those times?
My approach is to completely reject drift. We reapply from Git automatically every day. If Git is about to implement a change, it should notify the relevant support team. It’s a tough battle out there!
We rely on ArgoCD's auto-sync feature, along with monitoring its drift metrics, which we audit on a weekly basis. If any resource hasn't been in sync with Git for longer than a day, we create a ticket. We tend to avoid manual cleanup unless the drift is unusual; otherwise, we just trigger a sync and see how it goes.
That weekly audit sounds like a smart approach. Tracking drift age rather than reacting immediately seems like a great way to reduce noise. Have you ever encountered situations where auto-sync hides an issue from a manual fix that you'd only notice later?
Could you explain how your automated ticketing system works for prolonged drifts? Sounds intriguing!
We anticipate that drift will occur, so we focus on identifying it early. While manual changes via kubectl are permitted during incidents, anything that doesn't get backported to Git quickly is deemed a failure. This rule greatly reduces long-term drift. We continuously diff live state, not just after a failed sync, and only alert on persistent diffs. Also, we ignore expected changes from certain mutators to avoid excess noise. Remediation is straightforward: small diffs auto-revert, while larger ones generate a PR with context. We intentionally keep manual cleanup as the rare exception. Overall, drift is primarily a process problem rather than a tooling one.
This is honestly one of the best summaries I've come across! I especially love your point about tracking drift *age* instead of just its presence. It seems like many teams struggle and end up overwhelmed with diffs.
Isn’t the drift non-issue if you enable auto sync? If you apply changes manually, they just get overridden. If auto sync is off, there should be a clear audit trail to turn it back on.
Ideally, individuals shouldn't be changing live cluster resources. If that’s common, there may be a larger underlying issue. But on that note, we do utilize auto-sync and audit logs regularly.
That’s not always the case. I’ve seen many scenarios where people can change live clusters without any audit logs, making it tough to track down issues.
We use auto sync and set up alerts based on metrics when things go out of sync for a prolonged period.
Straightforward but effective! Do you adjust the time frame for this "extended period" per resource type, or do you stick to a global threshold? Some drift is definitely more critical than others.
With ArgoCD, auto sync is non-negotiable. If you make changes during an incident, they'll keep getting reverted until you learn your lesson.
But that does depend on your process and how much control you have over ArgoCD. Disabling the self-healing is quite easy if someone has the permissions.
Implementing proper RBAC can eliminate many of these problems.
Our setup is mostly Flux-managed resources, so it's almost impossible for drift to become an issue. If anything gets changed that's not managed by Flux, it just gets reverted. Works like a charm!
Everything needs to be in GitOps or it doesn't exist—simple as that. Drift problem solved!

Absolutely! As you scale up your team and resources, it becomes even more important to keep an eye on these issues.