I'm looking to gather some insights about real-world experiences with Kubernetes in production environments. For those out there using Kubernetes, what are some actual security issues you've encountered? Have you faced any observability gaps that caused significant problems? What types of incidents have you experienced in live settings? I'm particularly interested in practical failures rather than just theoretical best practices. Additionally, which open-source tools have proven to be the most valuable in addressing these challenges? I appreciate any real-life examples you can share to help us learn from your experiences!
5 Answers
For us, it’s been mostly smooth sailing. We have solid testing practices that catch issues early. But we've had our share of mishaps like DDoSing our internal container registry during node rollouts and issues with tools competing in iptables modes. Observability for network metrics has been tough—one time we had a namespace deleted with production apps! Overall, there's a lot to manage with constant upgrades and tool changes, which can add stress to the control plane, but having replica apps helps mitigate impacts significantly.
One major issue I've faced was creating a subnet that was too small for Kubernetes. We hit the limit and ended up needing a larger subnet IP range, which meant a ton of new firewall requests. It’s something to think about at the design stage!
We’re facing this too. Just had to upsize our subnet, not fun at all.
Not using IPv6 is a massive oversight, that was my first big mistake!
DockerHub rate limits hit us hard, causing major delays. We started self-hosting a Docker registry and using it as a pull-through cache, which has worked out really well in terms of performance and reliability. We also tried Harbor, which looks promising!
Using AWS ECR as a pull-through cache is another great option if you’re on AWS.
Honestly, self-hosting has saved us a lot of headaches!
One of the biggest problems we dealt with was RBAC misconfigurations. A leaked token allowed a pod to access far more than it should have. We struggled with lack of tracing for latency issues until we implemented Prometheus, Grafana, and Jaeger. Certificate expirations and network policy misconfigurations also wreaked havoc on our systems—definitely things not to overlook!
That’s a huge pain! Misconfigurations can definitely lead to a lot of chaos.
Right? We had a similar situation with logs not going to CloudWatch. Such a headache!
I remember one time I accidentally added around 60 machines to the apiserver pool instead of the node pool. Let's just say etcd wasn't happy and went down. I learned two key lessons: workloads keep chugging along even if the control plane fails for some time, and you can restore etcd from data directories without it being in a membership state—just shut down all but one apiserver!
Thanks for the insight! You've really paved the way for us rookie admins.
I’ve also seen surprising uptime when the apiserver is messed up! How does that even work?

Was your registry self-hosted or in the cloud? Just wondering how to handle that.