I'm really struggling with keeping my monitoring costs down while still maintaining visibility into the important metrics for my Kubernetes clusters. Despite trying to cut down on logs and metrics, my monitoring bill seems to keep climbing. I've used trace sampling and set shorter retention periods, but it often means missing crucial information when something goes wrong. I'm on AWS and utilizing tools like Prometheus, Grafana, Loki, and Tempo, but the highest costs come from storage and high-cardinality metrics. I've experimented with head and tail sampling, yet I still miss those rare errors that are vital. Any tips or advice on how to manage these costs while ensuring I don't lose visibility would be greatly appreciated!
10 Answers
We've had success by taking a few specific actions: First, eliminate high-cardinality labels right off the bat, like pod UID or request paths. They dramatically increase Prometheus storage. Next, keep detailed logs and traces short-lived and consider moving them to cheaper S3 storage for later use. Using exemplars to link metrics to traces is a better way to manage costs than trying to keep everything. We found that dynamic sampling based on errors and latency works much better than fixed sampling, allowing you to capture what's needed without breaking the bank. Sometimes it's just about maintaining 'good enough' visibility instead of perfect coverage.
What trace sampling methods have you tried? Getting a clearer picture might help tackle this better. Also, where exactly is your spending going? It feels like there's not enough information to pinpoint the issue.
You might want to use the cardinality explorer in VictoriaMetrics. It's a useful tool for diagnosing these issues!
This is definitely a complex challenge. I've found that it's a lot of trial and error to fine-tune what you collect. One approach to consider is to limit debug logs to only when you need them, like during active issues.
I think your issue lies in treating the symptoms instead of fixing the root causes. High-cardinality metrics often explode due to waste data generated by your applications. Focus on optimizing your Kubernetes resource configurations, refining your metric labels, and batching your log outputs. There are tools like Pointfive that can help you find those inefficiencies before they become expensive.
This is indeed a complex engineering challenge! You might find prom-analytics useful. It helps to analyze your metric usage and identify unused metrics, plus it can show expensive queries and how users interact with the data you collect. This can guide you on how long to store metrics based on real user patterns. Check it out!
Have you tried tail tracing? It might be worth looking into if you're not already. It could help you manage which traces you keep based on their relevance and impact.
One thing that really helped me is moving away from the 'collect everything' mindset. Focus on tagging only the namespaces and workloads that matter for debugging, and drop any redundant labels. Instead of using S3 for storage, push short-lived metrics to local Prometheus storage. I also added Groundcover to get eBPF-based visibility, which provided a clearer picture of what's happening in the cluster without altering the application code. This allowed me to highlight and eliminate unnecessary metrics and expensive traces. Reworking my alerting and sampling strategies based on actual usage rather than just raw volume significantly cut down my storage costs while still giving me enough context to troubleshoot issues effectively.
Are you using Prometheus? That could play a role in addressing your storage costs.
High-cardinality metrics really do wreak havoc on costs. I cut labels like pod UID and request paths from most of my Prometheus metrics, which reduced my usage by a third. For traces, dynamic sampling has been a game-changer; it focuses on saving the important data without sacrificing performance. Make sure to also watch for those dashboard panels that can cause heavy ad hoc aggregations, especially during peak traffic.

That sounds really useful! I'm currently working with an Alloy, Prometheus, Loki, and Thanos setup, and I find that collecting all logs and retaining them by tag is beneficial. Keeping metrics local and cheap is definitely the way to go.