Hey everyone! I'm really interested in hearing about your experiences using Kubernetes in production environments. Specifically, I'd love to learn about the security issues you've faced, any observability gaps that have caused headaches, and notable failures you've encountered. I'm looking for practical examples rather than just a list of best practices. Also, which open-source tools have you found most helpful in addressing these challenges, whether they're related to security, logging, tracing, monitoring, or policy enforcement? Thanks for sharing your insights!
5 Answers
There have definitely been ups and downs with our Kubernetes experience. One of the biggest challenges was dealing with DDoSing in our internal container registry during a rollout, which made pulling system images impossible. Also, a namespace got accidentally deleted with production workloads in it, yikes! Most of the time, we manage well, but you always need to be on high alert for those sudden failures.
We ran into some major issues with RBAC misconfigurations and overly permissive service accounts. Just one leaked token and a pod could access way more than it should have! On the observability side, before we setup Prometheus and Grafana, tracing was an absolute nightmare when debugging latency issues across services. It's crazy how much those cert expirations and misconfigured policies can throw a wrench in the works!
Wow, that sounds like quite a mess!
I once mistakenly added around 60 machines to the apiserver pool instead of the node pool, and let me tell you, etcd was furious! It collapsed under the load. I learned two things from that experience: workloads can keep running in their last state even if the control plane goes down and that I could recover etcd data without membership by shutting down the extra apiservers. Quite the adventure!
Thanks for sharing! Sometimes we just have to learn the hard way.
Isn’t it wild how some things keep working despite the chaos?
We've had a lot of trouble with DockerHub rate limits. It really disrupts our workflow at crucial times. To tackle this, we're self-hosting a Docker registry as a pull-through cache, which I think is a solid solution!
I recently set up Harbor for this problem, and it's been fantastic!
I migrated my customers to ECR public to avoid those limits entirely. No rate limiting there!
One time, we accidentally made our Kubernetes subnet too small. We hit the limits and had to expand the subnet, which turned into a headache with tons of firewall requests!
We're dealing with that right now too.
Not using IPv6 can definitely complicate this issue.

Was your registry self-hosted or managed by a cloud provider?