I'm finding that during incidents, I spend more time updating various tools than actually fixing the issue at hand. For example, during a recent incident, I created a PagerDuty incident, a Jira ticket, a Slack channel, a status page update, and a Confluence page for the postmortem. But as I was managing these updates, I missed a status page update, leading to my CEO messaging me about customer complaints before I could even address them. By the time I was done with the updates, the incident was largely resolved, and then I had to spend extra time aligning all timestamps across these different platforms. This seems inefficient! I have to wonder if there's a more effective way to manage all these tools during an outage. Are others facing this too? Is this just part of the job?
5 Answers
It sounds like you're not alone in facing this issue. Developing a system that tracks updates and disseminates them across all channels could be beneficial. Imagine having a tool that acts like a smart assistant during incidents!
You might want to have an incident manager on your team to handle updates like the status page while engineers focus on fixing the issue. That way, you’re not juggling updates and can dedicate your efforts to resolving the incident instead.
Consider building some automation tools to smooth out the process. For instance, a Slack bot that can start a PagerDuty incident, create a Slack channel, and notify the necessary team members could save you loads of time. Services like Rootly handle a lot of this well!
Totally agree. I've used Rootly and it really simplifies the incident management process. It takes care of a lot of the busy work!
Not interested in shopping for tools right now.
We have a cool internal tool that automates the entire process. You just type a command in Slack, and everything is set up in no time. It's a great example of how effective chatops can be in incident response.
Automation is key here! In our setup, the incident manager handles the Jira ticket, the status page, and postmortem creation, so the rest of the engineers can stay focused on resolving the problems.

Exactly! Coordination should be the incident manager's role. We've found this approach works so much better for everyone involved.