I'm curious about the real-world problems teams encounter while managing large numbers of Kubernetes clusters. What are the common pain points that come up?
6 Answers
Setting resource requests and limits is a big deal for us. Also managing local disks and network PVCs, and just keeping everything up-to-date. Those are our main challenges.
One major issue is resource management. I've dealt with clusters where pods kept getting OOMKilled because developers didn't set appropriate memory limits. Also, deploying with the 'latest' tag is risky—it's better to pin your versions. Network policies can get overlooked, which can lead to further complications.
For sure! And when it comes to pinning versions, go for the digest. Tags can change, and 'latest' ends up being dangerous.
Resource allocation is tricky. We also struggled with handling huge traffic spikes, like going from 50rps to 400k rps. We found this tool called Thoras.ai that predicts traffic effectively—just sharing, not affiliated at all!
Using the latest AWS AMI versions has led to outages for us. Now, we hardcode versions and test the new ones in environments before deploying them. Cluster updates can be a hassle too, but if you're using Infrastructure as Code (IaC), you can just loop through your terraform applies. And in AWS, Karpenter helps automate worker-level resource allocation, but planning node pools carefully is still essential. Overseeing application-specific resource requests and limits is crucial—if teams don’t manage it well, they waste resources. We set up notifications during deployments for better visibility.
Resource management is a headache, plus keeping nodes updated with the latest k8s versions and kernel upgrades on-premises. Getting teams to avoid creating monolithic setups out of microservices is more of a cultural issue, but still a struggle.
Are you on cloud or bare metal? Bare metal is definitely tougher—it requires careful monitoring of control planes and core API services, on top of everything else!
Absolutely! And don’t forget about persistent storage on bare metal—it can be tricky. NFS can work, but it brings its own challenges.