Our team recently completed our annual audit of internal monitoring tools, and I wanted to share some of what we do. We audit alerts across various platforms like Cloudwatch, Splunk, Chronosphere, Grafana, and custom cron jobs to determine if they're still necessary and accurate. Additionally, we review AWS Auto Scaling Groups (ASGs) to ensure they have the right resources and that they're still owned by our team. This is just a small part of our audit process. It often involves pulling data from different systems to assess the current status of our infrastructure and tools. We compile everything into a spreadsheet, and tasks are assigned to different team members. I'm interested in knowing:
- How often are you auditing your infrastructure and tools?
- Do you have any advanced tools for this process beyond just spreadsheets?
- What is the typical time frame for your audits?
I'd love to hear what strategies work well for others!
5 Answers
We're still in the early stages, but I’m developing a context layer that maps dependencies between our tools. We're collaborating with larger teams to clarify service ownership, aiming to uncover any unknown gaps in our setup.
We recently started using Drata for compliance management, and it has been helpful. For alerts, we add new ones when incidents occur that our existing setup didn’t catch. It can be a problem if alerts go unanswered, but we tackle that separately!
We streamlined our monitoring by consolidating multiple tools. We transitioned from Icinga, Munin, and Graphite to Prometheus, which makes it much simpler to pull in data from Cloudwatch and report on it from a single system.
We rely on a detailed Excel sheet that outlines all systems in play alongside non-functional requirements (NFRs). We start with one reliable system as a benchmark and identify gaps based on empty cells—it's a straightforward way to pinpoint issues.
In the world of DevOps, there are always unknowns that an audit might not reveal. For example, how can you tell if something wasn’t logged at all? Sometimes it feels like a chaos monkey is the only real solution to manage outages effectively.

Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures