I'm currently a PhD student focusing on program repair and debugging, and I want to contribute something valuable to SREs and DevOps engineers. I'm looking into how teams manage incidents. I'd appreciate insights from anyone who's on-call or involved in incident management. Here are a few specific points I'd love your thoughts on:
1. What do you find is the most challenging part of dealing with an incident? Is it differentiating between actual causes and background noise? Identifying recent changes? Mapping symptoms to the right services? Or maybe switching between different tools like Datadog, Jira, and Slack?
2. Apart from simply rolling back, what steps do you take when an incident occurs? Which tools do you prioritize first? What's your typical process from alert to finding the solution?
3. How do you handle searching through information during an incident? Do you have a preferred stack, like ELK?
4. Have you experimented with AI-driven SRE tools like Datadog Watchdog or Dynatrace Davis? Did they provide tangible benefits during real incidents? If not, what do you think is missing?
5. If you could magically solve one problem during incidents, what would it be? For instance, automatically highlighting the most likely problematic changes, or pulling together historical incident information?
I'm open to long responses and personal stories—your feedback can really shape my research, so thank you in advance!
5 Answers
In my experience, what really helps with debugging is having someone on the team who's been around for a while and knows the system inside out. There's no substitute for experience when it comes to this kind of work. And about AI tools? They can provide insights, but I've found they often miss the mark. You really need that human intuition and context which machines just can’t replicate. Every incident feels different, and without that deeper understanding, it gets tricky!
1. First step is always to assess how bad the impact is. Without enough information, it’s hard to find the root cause. 2. Once we have data, we look at our options to address the issue, often in parallel. Make sure to minimize impact effectively. After handling the incident, we conduct a post-incident review without blaming individuals to ensure we learn from what went wrong. It's all about finding a way to improve for next time!
The hardest part? Getting accurate info from users. Sometimes they refer to issues using terms we're not familiar with, which can lead us to the wrong place right off the bat. And then you have logs that don't provide the details we need—super frustrating! We often rely on AWS CloudWatch, but if something crucial isn’t logged, we’re out of luck. Making our logging more intelligent and results-focused has been a big priority for us.
I think the key lies in effective instrumentation and logging. If you have enough visibility into what's going on, diagnosing problems becomes a lot easier. Our steps usually include ensuring the system is observable, deployments can be rolled back, testing fixes in non-production, and then getting that fix out fast. It’s a cycle of constant improvement!
To start, I always ask, what's different? That’s usually the key question during an incident. After that, things can get a bit chaotic. It’s about getting to the root cause, which often takes some digging, especially if symptoms aren't clear. Using tools like Datadog for metrics is a must, but I often find myself jumping back and forth between dashboards and logs to understand the full picture. I wish there was a way to trace a single request across all services in a more streamlined manner, gathering relevant logs and metrics in one go—would save a lot of time!

Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures