We have an established incident response plan that includes on-call rotations, alert systems, and postmortems. Recently, customers have started asking how we validate our incident response practices, and I've come to realize we haven't really viewed this as something that requires concrete evidence. While we handle incidents and have access to logs and historical data, I'm looking for ways to gather this information more efficiently and on a daily basis so it can be easily presented. Beyond just showing screenshots, what other evidence can I compile? Is it true that having more evidence is better in this context? Any thoughts would be greatly appreciated!
4 Answers
Show metrics like Mean Time to Detect (MTTD) and Mean Time to Recovery (MTTR). These figures can demonstrate how incidents impact your customers. Gather insights on the cost of downtime for services or APIs and categorize incidents by severity. Remember, it's about providing context, and it helps if you get buy-in from other teams regarding definitions and responsibilities around incidents.
Consider running tabletop exercises regularly, where team members can simulate incident responses. It helps everyone understand their roles and can highlight any gaps in your incident management. Having a more hands-on approach with these drills can make a huge difference.
You could also hold 'game days' where a few engineers create scenarios around known weak points to test your incident response. It's not only effective for checking your alerts and logs but also helps sharpen your incident response strategies. Documenting these exercises could provide solid evidence for your practices.
This is a common concern as clients dig deeper into security processes. Most of them aren't looking for perfection; they want to see that your approach is systematic and replicable. Having a documented runbook along with a few real-life examples can convey that much more effectively than a theoretical plan that hasn’t been tested out yet.
Absolutely! It’s better to maintain ongoing records instead of scrambling to compile evidence after an incident. We started documenting in real-time to centralize our information. We even set up a system to track everything we needed to show up front, which relieved the guesswork during reviews.

Yeah, and if possible, try some chaos testing in your infrastructure. It’s a great way to see your system's resilience, and when you conduct a root cause analysis after an issue, if it doesn’t happen again, that’s proof your processes are working.