Hey everyone! I'm looking to gather insights on the tools and methods your teams use to automatically resolve common Kubernetes problems. Specifically, I'm interested in issues like OOMKilled pods, CrashLoopBackOff workloads, disk pressure with PVC, automating node drain and reboot, and HPA scaling saturation. We've experimented with a few solutions, but I'd love to hear about any proof of concepts or configurations that have worked well for you in production. What frameworks, scripts, or tools do you recommend to effectively handle these situations? I'm just trying to save the 5-15 minutes we typically spend addressing these issues each time they arise.
3 Answers
I think there are limits to automation. For OOMKilled pods, sure, we could auto-escalate memory, but that goes against resource configurations. Developers should ideally address those root causes. For the CrashLoopBackOff, again, it's best to have devs look at the code errors instead of relying on automation to fix them. However, for disk pressure, scaling up the volume could be automated, if one needs to go that route.
For me, the key methods include thorough load testing and preemptive alerts in staging. Implementing Cluster API with its alpha rollout features has been helpful as well. It's also essential to keep performing load testing with sensible resource limits to avoid future issues.
For automated node drain and reboot, tools like Cluster API and Karpenter are fantastic. They handle draining out of the box. But for the other issues you've mentioned, fixing applications should really be the priority. Focus on the core problems first!

I totally get that! Not everything should be automated. But using auto-remediation for known low-risk fixes like PVC resizing can definitely save engineers time to focus on more complex issues. It's about finding that balance!