I've been running into issues with Kubernetes pods getting stuck in the terminating state when nodes or zones fail. I want to know the best practices to handle this, especially if we can automate the resolution process instead of manually intervening every time.
4 Answers
One solution I've found works well is adjusting the taint eviction toleration threshold. By setting it to something lower, like 10 seconds for not-ready nodes, you can auto-reschedule pods from dead nodes much more quickly. Just keep in mind that this strategy works well for deployments but not stateful sets because of the way pod names are handled. I've had to implement a controller to force terminate stateful sets when issues arise, and it covers most cases.
Pods getting stuck in termination rarely happens without a reason. Often, it's due to finalizers that aren't completing, unreachable volumes, or hanging preStop hooks. To prevent this, keep your terminationGracePeriodSeconds reasonable and ensure your cleanup logic finishes properly. If a node is dead and not coming back, it’s good practice to cordon it and delete it to let Kubernetes clean up more efficiently. We've had success using CubeAPM to help spot these issues before they pile up—it's all about having visibility!
Honestly, Kubernetes could do a much better job of showing when finalizers are blocking resource deletion. It's frustrating when your pods get stuck, and you're left scratching your head. A dedicated event message for these situations would definitely help!
I hear you on that! Manual intervention is often required, especially if you're dealing with finalizers or PVC issues. For example, when there's a network issue, the node and API server can't communicate, causing pods to get stuck. It's crucial to find out what's causing these blocks to handle them better.

Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures