System Operations

How to Set Up a Datadog Monitor for Failing Kubernetes Cron Jobs?

November 30, 2025

Asked By TechieTurtle42 On November 30, 2025

I'm trying to set up a monitor in Datadog for my Kubernetes cron job, but I'm hitting a wall. I'm using the metric `min:kubernetes_state.job.succeeded{kube_cronjob:my-cron-job}`. Initially, it works perfectly, but when a job fails, the metric doesn't update as I expect. It makes sense since this metric aggregates successes, but now I'm stuck. I'm looking for a way to track failures similarly to how I track successes. My ideal behavior would be: Day 1 – the cron job runs successfully, and the query shows 1. Day 2 – it fails, and the query shows 0. Day 3 – it recovers, and the query shows 1 again. What am I missing here?

5 Answers

Answered By DevOps_Ninja On December 3, 2025

Another option is to track `kubernetes_state.job.failed` and alert on that if it’s greater than zero. This would give you the 0/1 behavior you're looking for with failures.

Answered By MonitorMaven On December 3, 2025

Congrats! You've hit on the classic success bias in monitoring. It's like a smoke detector that only warns about burned toast but not a blazing fire. You need to directly monitor failures to avoid that issue.

Answered By KubeMaster45 On December 1, 2025

One clean approach is to set the TTL for your CronJob's runs. Enable `ttlSecondsAfterFinished` to keep one job object per run, which helps Datadog monitor the latest job status. This way, you won’t mix historical success with today's failures.

Answered By K8sWhisperer99 On November 30, 2025

You're right that `kubernetes_state.job.succeeded` only shows success counts, so failures won't be reflected there. You might need to switch things up and set alerts for instances where your succeeded count is less than 1. That way, you can catch every failure.

Answered By DataGuru123 On November 30, 2025

It seems like you're running into a limitation of how Datadog aggregates metrics. The metric you're using, `min:kubernetes_state.job.succeeded`, reflects the lowest observed success over the time period you query, not real-time failures. Instead, consider monitoring `kubernetes_state.job.failed` directly or using a formula like `succeeded / (succeeded + failed)` for a clearer success/failure metric.

How to Set Up a Datadog Monitor for Failing Kubernetes Cron Jobs?

5 Answers

Related Questions

Can't Load PhpMyadmin On After Server Update

Redirect www to non-www in Apache Conf

How To Check If Your SSL Cert Is SHA 1

Windows TrackPad Gestures

LEAVE A REPLY Cancel reply