I've noticed that a big chunk of my GKE bill, around 30%, is due to traffic costs associated with inter-zone data transfers. My project relies heavily on internal traffic, which can rack up monthly data exchanges in the hundreds of terabytes. Currently, my cluster has nodes spread across all the zones in the region by default. I tried to save costs by consolidating all nodes in a single zone, but I'm concerned that this compromises availability. I'm looking for a way to maintain a multi-AZ setup for reliability while minimizing intra-AZ communication costs. I know one workaround is to set up separate application stacks for each AZ and use load balancing, but that feels overly complicated. Is there a simpler method to encourage local service communication within Kubernetes?
3 Answers
Have you considered using topology-aware routing? It could help in optimizing your traffic efficiently.
We made the shift to operate in just one AZ for processing while using multi-AZ storage on S3. It has substantially lowered our costs. Consider how many AZ outages have happened in the last few years—you might be surprised at how reliable they are. Does it really make sense to spend 30% of your budget to mitigate just a small risk of downtime each year?
That’s what I thought when I went for a single AZ!
I spent 7 years in AWS using single AZ setups—never faced issues that a quick restart couldn’t fix. In my opinion, the savings are worth it, especially since the likelihood of needing redundancy seems low.
There's no one-size-fits-all answer, but you might want to look into the `preferredDuringSchedulingIgnoredDuringExecution` node affinity rule. With this, you could prioritize scheduling in a single AZ while still keeping some nodes in another. This way, if anything happens, your pods can automatically move to the other AZ. But be cautious—if you have stateful workloads, this won't completely solve your data transfer issue since you would still have to sync data across AZs.
Another idea is to structure your database to minimize cross-node data traffic. For instance, doing joins locally or replicating smaller tables across AZs can help. Just remember that the 30% expense is real; though you can optimize, it's likely to be a constant factor.
Not yet, but it sounds like it could be a good solution!