System Operations

How Are You Handling Drift Between Cluster State and GitOps?

January 17, 2026

Asked By TechieTurtle42 On January 17, 2026

I'm really curious about how teams manage drift between what's in Git and what's actually running in their clusters. I'm not referring to the obvious sync failures, but more to the gradual changes like manual kubectl fixes, urgent hotfixes during incidents, changes made by operators, or even upgrades that slightly alter the state. What are your strategies for catching drift early instead of discovering it weeks later? Do you set up alerts for it, run diffs, or simply rely on re-syncs? And once you identify drift, how do you handle it in practice? Do you use auto-reverts, pull requests, or some manual cleanup? It seems like everyone is on the GitOps bandwagon, yet the "day 2" issues with drift are still quite messy. I'd love to hear about real-world setups rather than just theoretical answers!

10 Answers

Answered By TeamPlayer On January 19, 2026

In a small team, we use a full CICD+IaC setup. Sure, drift happens, but we each take accountability for our actions. Mistakes can be minor, and we focus on improving our processes rather than obsessing over perfect practices. On larger teams, especially with multiple developers active in the clusters, it's crucial to detect drift quickly.

ScalingDev - January 19, 2026

Absolutely! As you scale up your team and resources, it becomes even more important to keep an eye on these issues.

Answered By FixItFast On January 19, 2026

We handle it through hot fixes via GitOps, but I think it really depends on team discipline. I've witnessed systems fall apart under pressure, with backports either happening late or not at all. I'm interested in how you maintain that discipline when everything's on fire!

IncidentWizard - January 19, 2026

Yeah, that method can easily fall apart in a crisis. What strategies do you have in place to enforce compliance during those times?

Answered By GitOpsGuru On January 19, 2026

My approach is to completely reject drift. We reapply from Git automatically every day. If Git is about to implement a change, it should notify the relevant support team. It’s a tough battle out there!

Answered By CloudyCoder77 On January 19, 2026

We rely on ArgoCD's auto-sync feature, along with monitoring its drift metrics, which we audit on a weekly basis. If any resource hasn't been in sync with Git for longer than a day, we create a ticket. We tend to avoid manual cleanup unless the drift is unusual; otherwise, we just trigger a sync and see how it goes.

DriftWatcher99 - January 19, 2026

That weekly audit sounds like a smart approach. Tracking drift age rather than reacting immediately seems like a great way to reduce noise. Have you ever encountered situations where auto-sync hides an issue from a manual fix that you'd only notice later?

MetricsMaster - January 19, 2026

Could you explain how your automated ticketing system works for prolonged drifts? Sounds intriguing!

Answered By ProactiveDev On January 18, 2026

We anticipate that drift will occur, so we focus on identifying it early. While manual changes via kubectl are permitted during incidents, anything that doesn't get backported to Git quickly is deemed a failure. This rule greatly reduces long-term drift. We continuously diff live state, not just after a failed sync, and only alert on persistent diffs. Also, we ignore expected changes from certain mutators to avoid excess noise. Remediation is straightforward: small diffs auto-revert, while larger ones generate a PR with context. We intentionally keep manual cleanup as the rare exception. Overall, drift is primarily a process problem rather than a tooling one.

InsightfulInnovator - January 19, 2026

This is honestly one of the best summaries I've come across! I especially love your point about tracking drift *age* instead of just its presence. It seems like many teams struggle and end up overwhelmed with diffs.

DriftAnalyst - January 19, 2026

Isn’t the drift non-issue if you enable auto sync? If you apply changes manually, they just get overridden. If auto sync is off, there should be a clear audit trail to turn it back on.

Answered By ControlFreak On January 18, 2026

Ideally, individuals shouldn't be changing live cluster resources. If that’s common, there may be a larger underlying issue. But on that note, we do utilize auto-sync and audit logs regularly.

AuditAdvocate - January 19, 2026

That’s not always the case. I’ve seen many scenarios where people can change live clusters without any audit logs, making it tough to track down issues.

Answered By MetricMaven On January 17, 2026

We use auto sync and set up alerts based on metrics when things go out of sync for a prolonged period.

TimingExpert - January 19, 2026

Straightforward but effective! Do you adjust the time frame for this "extended period" per resource type, or do you stick to a global threshold? Some drift is definitely more critical than others.

Answered By ArgoAficionado On January 17, 2026

With ArgoCD, auto sync is non-negotiable. If you make changes during an incident, they'll keep getting reverted until you learn your lesson.

SyncSkeptic - January 19, 2026

But that does depend on your process and how much control you have over ArgoCD. Disabling the self-healing is quite easy if someone has the permissions.

Answered By RBACHero On January 17, 2026

Implementing proper RBAC can eliminate many of these problems.

Answered By FluxFanatic On January 17, 2026

Our setup is mostly Flux-managed resources, so it's almost impossible for drift to become an issue. If anything gets changed that's not managed by Flux, it just gets reverted. Works like a charm!

FluxDefender - January 19, 2026

Everything needs to be in GitOps or it doesn't exist—simple as that. Drift problem solved!

How Are You Handling Drift Between Cluster State and GitOps?

10 Answers

Related Questions

Can't Load PhpMyadmin On After Server Update

Redirect www to non-www in Apache Conf

How To Check If Your SSL Cert Is SHA 1

Windows TrackPad Gestures

LEAVE A REPLY Cancel reply