Hey everyone! I'm currently running an RKEv2 cluster with 3 master nodes, 4 worker nodes, and around 240 containers. I'm looking to enhance my observability because we're facing some SIGTERM issues and sporadic database disconnections that are causing service interruptions.
Here are my requirements:
- Monthly budget capped at $100.
- I need something that can intelligently identify the root causes of issues.
- The setup and maintenance should be relatively easy.
- Strong alerting capabilities are a must.
- I'm currently using DataDog just for logging.
- Open to considering self-hosted options as well.
I really need a way to understand why these SIGTERM signals are happening without spending countless hours digging through logs and metrics. Any recommendations?
3 Answers
Truthfully, monitoring tools might not pinpoint your exact SIGTERM issues. Typically, SIGTERM signals occur when the kubelet tries to gracefully shut down a pod, likely due to a misconfiguration or resource limits. Check your logs for events related to this, and don’t forget to look into your cgroups driver setup.
You could start off with Grafana along with Prometheus for monitoring. If issues persist, consider adding Loki for logs and Tempo for tracing. But honestly, if you can, steer clear of DataDog. It's a bit overkill and costly for what you're getting.
Have you looked into VictoriaMetrics and their logging options? They have a solid stack that works well. You would use VictoriaMetrics for metrics, and possibly stack that with Grafana for the visualization side. If you've got the time, check out their helm charts for deployment.
Could you drop some pointers on which helm chart or setup you'd suggest? Is Grafana still part of the mix?