Hey folks,
I'd like to share a situation we faced with our RKE2 cluster setup that uses Istio and Canal for networking. We initially ran with 6 Control Plane (CP) nodes. Here's the timeline of events:
**The Incident:** We ended up losing 3 of those CP nodes at the same time. I thought the data plane would be fine, but the result was a complete outage. It wasn't just the API that went down; our applications started failing, we couldn't resolve traffic, and we got bombarded with `503` errors.
I'm hoping to understand what might have caused this issue. Any insights?
5 Answers
Your CNI orchestration workloads are probably hitting the Kubernetes API for resource updates constantly. When the API fails, that interrupts everything, including pod networking and services like kube-dns. I haven't used Canal personally, but I imagine it behaves similarly. Your best bet is to ensure that there’s a reliable way to access the API even when nodes are failing.
Didn’t you set up observability for your control planes? That would help you track these kinds of issues. It’s odd to not have insight during an outage like this, right?
Imagine the management approving all those resources for a high-availability cluster, and then this outage happens! It’s funny how we tend to think splitting nodes across different data centers makes us immune to failure, but clearly, that’s not the case.
The thing is, having 6 control nodes is risky, especially after people warned you about it. An odd number is always recommended—this way, you maintain quorum. Also, what really caused you to lose 3 control planes? Was it a network partition, server crashes, or something with etcd?
It seems like your control plane lost quorum, which happens when more than half the nodes are gone. When you have an even number of CP nodes, like 6, it's really a bad practice because you can't maintain a majority (which is critical for keeping systems functional). I’d suggest reducing your CP nodes to 5 for better fault tolerance. When the cluster disruption is that extreme, it's pretty much a given that other components will start failing too, especially since the API server isn’t responsive anymore.

Yeah, I have observability in place, but my concern was more about the data plane issues rather than control plane metrics.