I'm new to Kubernetes and trying to figure out how to get good observability for my EKS cluster. I'm considering using the OTLP protocol for sending metrics and logs directly from my applications, but I'm hesitant to use agents or collectors like Alloy or OTLP-collector. I worry I might miss out on some pod logs, but I plan to push logs anyway. Currently, I'm focusing on getting node and pod metrics set up, which means I'll need to deploy Prometheus and Grafana along with the necessary scrapers. The trouble is, there are so many ways to deploy them—like using the Prometheus operator, kube-prometheus, and various Grafana charts. It's confusing how all these methods differ yet achieve similar ends. Why has the observability landscape become so complicated?
5 Answers
You can definitely set it all up using Helm without those agents, but you really should have monitoring in place. Just a tip: if you're building alerts for cluster health, try to set those up in a separate cluster, not the same one that's being monitored—like having a fire alarm outside your house instead of inside it! If the monitored cluster goes down, you want your alerts to still work. Just also keep an eye on the volume of data; it can get huge fast, so using tools like Thanos for compressing old data will save you a lot in the long run.
I'm not diving into your main confusion, but just a heads up: `kube-prometheus` is a Helm chart that installs the Prometheus operator and all its custom resources. It's essentially the backbone for deploying monitoring solutions. Understanding how these Helm charts work will really help clarify things for you. Most stacks these days seem to revolve around Grafana’s LGTM stack or Prometheus combined with other tools like Fluentd or Jaeger.
Yeah, it's an interesting thought. Collectors can be helpful, especially in easing the load on your backend systems.
Absolutely agree with the point about managing volume; logs can pile up quick! Just make sure you manage label sprawl when you're setting up your monitoring. Having too many labels can lead to storage issues down the line. It's not as scary once you get it set up properly, but the complexity can be a headache during troubleshooting!
Don't stress too much! Just go with something like Alloy or the OTLP collector alongside the LGTM stack—Loki for logs, Grafana for metrics, and Tempo for traces. You can get started easily using the k8s-monitoring-helm chart from Grafana's GitHub. It's straightforward! Once your cluster grows, you might start exploring operators and more complex setups.
Totally agree! Adding Grafana Beyla can help with auto-instrumentation for various programming languages, plus you won’t need to change the app. It's great for legacy systems.
For a no-fuss option, consider Grafana Cloud; it's affordable and gives you quick Kubernetes integration without the headache of self-hosting. Just be aware that managing Prometheus gets tricky as your needs grow.
Thinking of a full Grafana stack sounds like a solid plan. I recommend keeping components like Mimir, Loki, and Tempo separately deployed for better control. Is it worth putting a collector in between, like Alloy or Fluentbit? What do you think?