I'm trying to set up a monitor in Datadog for my Kubernetes cron job, but I'm hitting a wall. I'm using the metric `min:kubernetes_state.job.succeeded{kube_cronjob:my-cron-job}`. Initially, it works perfectly, but when a job fails, the metric doesn't update as I expect. It makes sense since this metric aggregates successes, but now I'm stuck. I'm looking for a way to track failures similarly to how I track successes. My ideal behavior would be: Day 1 – the cron job runs successfully, and the query shows 1. Day 2 – it fails, and the query shows 0. Day 3 – it recovers, and the query shows 1 again. What am I missing here?
5 Answers
Another option is to track `kubernetes_state.job.failed` and alert on that if it’s greater than zero. This would give you the 0/1 behavior you're looking for with failures.
Congrats! You've hit on the classic success bias in monitoring. It's like a smoke detector that only warns about burned toast but not a blazing fire. You need to directly monitor failures to avoid that issue.
One clean approach is to set the TTL for your CronJob's runs. Enable `ttlSecondsAfterFinished` to keep one job object per run, which helps Datadog monitor the latest job status. This way, you won’t mix historical success with today's failures.
You're right that `kubernetes_state.job.succeeded` only shows success counts, so failures won't be reflected there. You might need to switch things up and set alerts for instances where your succeeded count is less than 1. That way, you can catch every failure.
It seems like you're running into a limitation of how Datadog aggregates metrics. The metric you're using, `min:kubernetes_state.job.succeeded`, reflects the lowest observed success over the time period you query, not real-time failures. Instead, consider monitoring `kubernetes_state.job.failed` directly or using a formula like `succeeded / (succeeded + failed)` for a clearer success/failure metric.

Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures