Hey folks! We recently had a serious issue in our Kubernetes setup that's left us scratching our heads. We're running a somewhat odd configuration with 6 control plane nodes (not ideal, I know). Our storage solution is Longhorn, and we have various stateful apps running, including Vault, Loki, and Prometheus.
Here's the situation: three of our master nodes went down simultaneously, which rendered the whole cluster non-functional for a bit. They rebooted about 5-10 minutes later, and everything eventually came back online.
After investigating, we found that the kube-api-server process was OOM-killed on the affected nodes due to high RAM usage. Furthermore, we discovered kernel-level logs indicating significant disk and I/O errors, and an iostat check showed a super high I/O percentage.
We suspect Vault could be the culprit since it's running on the master nodes, which is usually not recommended. But curiously, the nodes that failed were not the same ones hosting the Vault pods. Given that this odd setup had been functioning okay until now, we're stumped.
Could Longhorn's heavy lifting (like replication or snapshotting) have triggered an I/O storm causing the kube-api-server to balloon in memory and get killed? Or could etcd's performance issues in high I/O situations have led to this cascading failure? Has anyone here witnessed a similar scenario?
4 Answers
First off, having 6 control plane nodes isn't common practice, as odd numbers help avoid split-brain issues. It's generally best to avoid running workloads on those nodes unless absolutely necessary. If one service hogs resources, it can lead to situations like the one you described, where critical services, like the kube-api-server, can run OOM.
I think high I/O could definitely be a concern. It often aligns with high memory usage when too many write operations are happening.
Totally agree! High I/O might be the cause of your kube-api-server's memory issues, especially if it was processing large requests. Was there anything unusual about the types of requests the API servers were handling? Also, is swap enabled? Sometimes, enabling audit logs can significantly impact disk writes, which might explain the I/O spike.
+1 to that! Auditing can really cause disk writes to skyrocket.
And don't forget to check how close to the memory limit your servers usually run. Knowing this can clarify a lot.
It’s also worth considering whether any services in your cluster might be creating excessive requests to the API server, almost treating it like a database. Large Custom Resource Definitions (CRDs) or additional objects can create a query storm, overwhelming etcd and the API server, which could balloon their memory use and lead to OOM kills.
Right? Services like Trivy are notorious for doing this.
I'm curious about your underlying storage. I've seen SAN issues lead to similar problems, especially with etcd being sensitive to any disk interruptions, which can crash system pods. What kind of storage setup do you have? That might be a crucial aspect to look into!

Exactly! It's crucial to keep core services insulated from potential resource starvation.