I'm really curious about how smaller teams—or those without a big budget for enterprise tools—manage the trade-offs involved in monitoring and observability. For instance, platforms like Datadog, New Relic, or CloudWatch can start costing a lot once you begin tracking a wider range of metrics. It gets tricky because trimming down metrics feels somewhat risky. So for those of you running lean infrastructure setups, I have a few questions: Do you actively drop or sample metrics, logs, or traces to save on costs? Have you come across any budget-friendly tools or stacks, like Prometheus combined with Grafana or Loki/Tempo, that still provide adequate visibility? What criteria do you use to determine what's worth monitoring versus what could be considered a 'nice to have'? I'm just keen to learn how different teams strike a balance between observability depth and costs in real scenarios.
1 Answer
From my experience, metrics aren't as costly as many think. We focus on sampling traces because when we have strong metrics, tracing is primarily useful for analyzing errors; for that, we use tail-based sampling. In the past, we paid a fortune for tracing that was really just capturing what metrics could handle, like average/percentile response times. Regarding logs, it varies based on use case, but usually, we only log warnings or higher in production to manage costs, especially since audit logs can be pricey. Keeping an eye on what you really need is key.

How do you manage audit logs, though? They seem to be a tricky area where you can't really delete anything, but you also can't afford to keep everything!