I'm using NodeProblemDetector (NPD) to monitor our nodes, but it failed to notice that Containerd was non-functional for several hours. We've observed some issues, like containers stuck in a kernel D-state where SIGKILL has no effect, and accumulating shims due to a 'StopContainer' deadline being exceeded. NPD didn't catch that Containerd was unresponsive, leading to a lack of an unhealthy node condition. How can I ensure that a non-functional Containerd is detected in the future so that the node status gets updated correctly?
4 Answers
During this scenario, was the node condition marked as Unhealthy? If you're running in a cloud environment, node pools typically have self-healing mechanisms. But if you're not using that, it might help to look into metrics reports, like what the cluster autoscaler provides to understand what's going on.
I personally rely on Kuberhealthy for monitoring. It runs various checks like daemonset, deployment, and DNS checks, and I have alerts set up with Prometheus for pods that are stuck during startup. Just keep in mind that it can fill up your Kubernetes event logs since it actively monitors a lot. But I'd rather catch an issue with a test container instead of letting a production app get stuck.
It sounds like you might have some stuck processes on your hands. One thing you could do is set up monitoring to check for these stuck processes. You can run a command like `ps -eo state,pid,comm | grep '^D'` to get that information and set alerts on it. However, I'd definitely dive into figuring out why this is happening in the first place, particularly around storage monitoring.
Yeah, unfortunately, NPD doesn’t handle runtime hangs very well! I’d suggest implementing a custom check using crictl or a containerd socket probe along with a watchdog that can mark the node as unhealthy or trigger a reboot if something's stuck. Also, keep an eye out for shim buildup or spikes in D-state as these are usually early warning signs.

Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures