During our incident retrospectives, there's a recurring theme where everyone agrees we should have alerts for certain issues, and tickets are created for this. However, these tickets often sit untouched for weeks because no one wants to tackle writing the PromQL. By the time someone finally addresses it, another incident has usually occurred, and the cycle continues. I've experimented with a tool that automates the creation of Prometheus alert configurations from incident notes, but I'm unsure if it's worth further development or if this is a common issue. How do others manage this workflow? Is there a better way to ensure alert tickets don't go stale?
7 Answers
We discuss alert conditions right during our incident response meetings and assign them immediately with deadlines. It really helps to keep the momentum going. Many times, someone already resolves it before we even wrap up the meeting.
As others mentioned, assigning it to someone is key. Make sure there’s accountability.
It sounds like the main issue is accountability. Maybe create a task in your project management tool and assign it to someone right away to ensure it gets done. Waiting around often leads to these tickets being forgotten.
I've seen those tickets sit idle for ages myself. The priority needs to be there to prevent that.
I honestly haven't faced this problem much. In my team, postmortem tickets take precedence, and implementing or updating alerts typically takes about 20 minutes. They rarely sit in the backlog for long, and they’re just an easy win for us.
Honestly, I'd just write the PromQL myself when the incident is fresh. If there's a sense of friction that's stopping you, that’s the problem to address, not the complexity of the task itself.
It sounds like there might be a bigger issue with team dynamics. Maybe consider that AI could generate PromQL for you now, making it a quick task. There could be serious underlying issues with your DevOps culture if no one wants to take charge on this.
How long does it actually take to write that PromQL? I think you're looking for a solution to a cultural issue more than anything else. If the pressure is on to ship features, ops tasks like this often get neglected.
Absolutely, culture can be the core issue. Developers at my place handle ops too, but when there's pressure, those tasks take a back seat.

Totally! This does seem like more of an accountability issue than a technical one. In my experience, we would have those alerts set up within a day without needing to wait for the sprint planning. It just gets done.