I'm curious about how often you all discover problems in your software once it goes live. What strategies do you have in place for identifying these issues, and how do you get alerted when something goes wrong?
2 Answers
We rely on Prometheus along with Alertmanager for alerts. Whenever there's an issue, Alertmanager notifies our team through Slack, which helps us respond quickly.
I use Datadog's watchdog alerts for each service we run. It automatically alerts the specific engineering team responsible, allowing them to debug, roll back, or fix issues through CI as needed.
If you're looking to manage costs, have you considered using open-source solutions like otel or signoz?

Does the alert provide a stack trace, or is it just sending an error log?