For teams using OpenTelemetry in production, how do you typically identify the root cause when your observability costs increase? Is this something you manage at the collector level, the backend, or do you not have control over it at all?
3 Answers
If you're self-hosting, I suggest switching Prometheus to VictoriaMetrics, Loki to VictoriaLogs, and Tempo to Jaeger or Victoria Traces. You can keep your Grafana dashboards while testing these alternatives. Even with lower retention or metric complexity, you can see a significant reduction in compute costs and overall complexity.
Wait, $2k a year for 500 vCPUs? That's pretty impressive! In AWS, you'd be hard-pressed to find a setup that cheap—most base instances start around that price, right? How do you manage the compute resources so affordably?
We’ve got a fully self-hosted LGTM stack running on Azure Blob for storage, and honestly, we don't really worry about costs since it’s only about $2k per year. But then again, we’re not a massive operation—just 500 CPUs with 15 million metrics and around 1TB of logs.

That's a great suggestion! We're looking for ways to optimize our stack without losing functionality. Thanks for the tip!