I'm curious about how small to medium-sized businesses handle their monitoring and alerting with tools like Grafana, Datadog, and Groundcover. Specifically, I'd like to know how they determine what metrics to monitor, what should trigger alerts, and the threshold levels for those alerts. What does the process look like in your organization? Is it usually reactive, like learning from incidents by figuring out what metrics were lacking? Do you also take a proactive approach by regularly reviewing your infrastructure and services to create relevant alerts? Any insights on streamlining this process would be really helpful!
1 Answer
We use GitOps for managing our alertmanager and Grafana configurations, making it super easy to add new alerts or dashboards with just a pull request. We focus on monitoring metrics that have proven to cause issues in the past, so we only alert on what really matters. This knowledge builds over time, helping us avoid unnecessary alerts.

But what's the source of those alerts? Are they mainly from alert manager configurations?