I'm dealing with an application that has a memory leak and can't just be killed without repercussions. We've set up alerts in Prometheus that trigger when memory usage hits a certain threshold, but that means we have to manually delete the pods—a tedious task sometimes twice a day. I've thought about creating a monitor app or a CronJob that deletes the pod automatically when the threshold is reached, but I'm unsure how to execute that. Is anyone familiar with a better solution, or can you recommend tweaks for this process?
1 Answer
First off, if your app can't be killed, that's a red flag. Generally, you want to follow the practice of treating your pods like cattle, not pets—if something's wrong, it's time to let it go. Have you considered using a readiness probe to stop accepting new connections before memory limits are reached? You could then fail a liveness probe, which should trigger a graceful shutdown without abruptly stopping processes. This setup could help manage those memory leaks while your dev team works on a fix.

That makes sense! Failing the probes first should help prevent user impact. I'll look into setting that up!