I'm setting up an internal load balancer (with external-dns as a nice bonus) for several Kubernetes clusters so that my central Thanos can scrape metrics from them. I want to stick to Kubernetes-native solutions and avoid relying on cloud infrastructure. Do you think implementing a service mesh would be excessive for this purpose?
4 Answers
We actually do this setup! Our clusters are fully independent but connected through Thanos using Istio. The great thing is if Istio ever goes down, the system still operates independently without data loss. Honestly, this approach has its advantages over remote-write because it allows for temporary outages in Istio without raising alarms. Just make sure you design it properly since Istio can be tricky and may fail if not handled carefully.
In my opinion, a service mesh might be overkill here. I recommend using the Thanos sidecar for each Prometheus instance instead. You’ll just need to ensure that you can query metrics from the leaf nodes for the last couple of hours.
Another option to consider is using Tailscale instead of a full service mesh.
Have you considered whether Thanos supports Prometheus remote write? Instead of having Thanos retrieve metrics from your clusters, the clusters could push their metrics to Thanos directly. Just bear in mind, you might need to set up a lightweight collector like Grafana Alloy in each cluster for this.
That's right! You would use the Thanos-receive component as a target for remote writes. However, keep in mind that it can be quite memory-intensive.
I haven’t faced any major issues with Istio going down. There was one incident when the istiod pod went down due to user error, but the gateways continued to function without a hitch.