System Operations

How Do You Handle Maintenance for Multiple Kubernetes Clusters?

December 8, 2025

Asked By CuriousCoder42 On December 8, 2025

Hey everyone! I'm part of a team that's managing a growing number of Kubernetes clusters—dozens to be exact—and we've hit a significant challenge when it comes to maintenance work, particularly upgrades. It feels like we're in an endless loop because just as we finish rolling out one version upgrade across all clusters, we're already gearing up for the next one. The K8s N-2 support window is helpful for security, but the pace is relentless when it comes to scaling.

Upgrades don't just involve the Kubernetes control plane; they often require updating the CNI, CSI, ingress controllers, and more. We've also got a lot of operators and controllers, like Prometheus and cert-manager, each with their own release cycles and potential breaking changes.

We operate in a hybrid environment with both managed cloud clusters and bare-metal clusters. I'm really eager to hear how other teams manage their upgrades and maintenance across many clusters. Here are some specific things I'm curious about:
1. Do you use any orchestration or automation tools for the upgrade process?
2. What criteria do you use to decide when to upgrade, and how long does it take to roll out upgrades?
3. What do your pre-flight and post-upgrade checks look like? Any tools you'd recommend?
4. How do you handle the lifecycle of all your add-ons? This has been a real headache for us.
5. How many people on your team focus on this? Is it a single person, a team, or do you rotate responsibilities?

Thanks in advance for any insights or experiences you can share!

5 Answers

Answered By SRE_Guru On December 9, 2025

Welcome to the world of SRE! For managing upgrades, we use a mix of CI/CD tools and custom orchestration controllers. Our upgrade strategy varies by subsystem—aiming for upgrades every few months, usually taking no more than two weeks for a complete rollout. We check for alerts before and after upgrades, which has worked surprisingly well.

DevOpsDreamer - December 9, 2025

Can you elaborate on your orchestration setup? What does that flow look like?

Answered By AnsibleJack On December 9, 2025

Currently, I’m on a contract with a bank using Ansible for automation, specifically for cluster upgrades with a set of detailed playbooks. However, their heavy change control consumes a lot of my time just planning and executing upgrades, and they don’t want to improve the process. It’s draining!

FrustratedEngineer - December 9, 2025

That sounds rough. Tough to improve things when the team isn’t on board.

Answered By TechnoWizard1 On December 9, 2025

We use cluster-api to help manage our Kubernetes offerings for customers. Thanks to that automation, updates have become almost effortless for us. We typically update our clusters every month, which only takes about half a day overall. In terms of human effort, it’s less than 10 minutes a month! The key is automation—upgrading should be as straightforward as changing a single number.

InsightfulAlice99 - December 9, 2025

What kinds of defects are you able to detect automatically?

UpgradeSkeptic19 - December 9, 2025

This is what I wish for! Our situation is a bit different since we have dozens of clusters in the cloud and hundreds on-prem. I'll check out the cluster-api project.

Answered By ArgoFan42 On December 9, 2025

A big improvement for us has come from using ArgoCD AppSets pointing to our core apps repository. This means instead of needing to update charts for each cluster, we just change them a few times—once for dev, staging, and prod. It's helped keep things manageable as we scale up the number of clusters.

Answered By ConsolidationHero On December 9, 2025

Honestly, a lot of the hassle comes from managing too many clusters. If isolation for users is the only reason you're using so many clusters, consider switching to hosted control planes or virtual clusters. Upgrading a vCluster control plane is super quick—think seconds instead of hours! Projects like Sveltos can also help manage add-on predictions across multiple clusters.

ClusterChallenger - December 9, 2025

I was going to ask the same thing: why so many clusters? We only have one QA cluster with multiple environments as namespaces.

Related Questions

Can't Load PhpMyadmin On After Server Update

Redirect www to non-www in Apache Conf

How To Check If Your SSL Cert Is SHA 1

Windows TrackPad Gestures

LEAVE A REPLY Cancel reply