Hey everyone! I'm part of a team that's managing a growing number of Kubernetes clusters—dozens to be exact—and we've hit a significant challenge when it comes to maintenance work, particularly upgrades. It feels like we're in an endless loop because just as we finish rolling out one version upgrade across all clusters, we're already gearing up for the next one. The K8s N-2 support window is helpful for security, but the pace is relentless when it comes to scaling.
Upgrades don't just involve the Kubernetes control plane; they often require updating the CNI, CSI, ingress controllers, and more. We've also got a lot of operators and controllers, like Prometheus and cert-manager, each with their own release cycles and potential breaking changes.
We operate in a hybrid environment with both managed cloud clusters and bare-metal clusters. I'm really eager to hear how other teams manage their upgrades and maintenance across many clusters. Here are some specific things I'm curious about:
1. Do you use any orchestration or automation tools for the upgrade process?
2. What criteria do you use to decide when to upgrade, and how long does it take to roll out upgrades?
3. What do your pre-flight and post-upgrade checks look like? Any tools you'd recommend?
4. How do you handle the lifecycle of all your add-ons? This has been a real headache for us.
5. How many people on your team focus on this? Is it a single person, a team, or do you rotate responsibilities?
Thanks in advance for any insights or experiences you can share!
5 Answers
Welcome to the world of SRE! For managing upgrades, we use a mix of CI/CD tools and custom orchestration controllers. Our upgrade strategy varies by subsystem—aiming for upgrades every few months, usually taking no more than two weeks for a complete rollout. We check for alerts before and after upgrades, which has worked surprisingly well.
Currently, I’m on a contract with a bank using Ansible for automation, specifically for cluster upgrades with a set of detailed playbooks. However, their heavy change control consumes a lot of my time just planning and executing upgrades, and they don’t want to improve the process. It’s draining!
That sounds rough. Tough to improve things when the team isn’t on board.
We use cluster-api to help manage our Kubernetes offerings for customers. Thanks to that automation, updates have become almost effortless for us. We typically update our clusters every month, which only takes about half a day overall. In terms of human effort, it’s less than 10 minutes a month! The key is automation—upgrading should be as straightforward as changing a single number.
What kinds of defects are you able to detect automatically?
This is what I wish for! Our situation is a bit different since we have dozens of clusters in the cloud and hundreds on-prem. I'll check out the cluster-api project.
A big improvement for us has come from using ArgoCD AppSets pointing to our core apps repository. This means instead of needing to update charts for each cluster, we just change them a few times—once for dev, staging, and prod. It's helped keep things manageable as we scale up the number of clusters.
Honestly, a lot of the hassle comes from managing too many clusters. If isolation for users is the only reason you're using so many clusters, consider switching to hosted control planes or virtual clusters. Upgrading a vCluster control plane is super quick—think seconds instead of hours! Projects like Sveltos can also help manage add-on predictions across multiple clusters.
I was going to ask the same thing: why so many clusters? We only have one QA cluster with multiple environments as namespaces.

Can you elaborate on your orchestration setup? What does that flow look like?