I'm looking for advice from a FinOps perspective regarding our Kubernetes cluster. I've noticed that many of our nodes are only utilizing about 20-30% of their capacity, which seems like a good opportunity to consolidate and lower our node count. However, the DevOps team tells me that some pods are effectively unevictable, which prevents us from draining those nodes. The reasons behind this include pod disruption budgets, local storage requirements, strict affinities, and sometimes just a lack of alternative nodes that can host these pods. So while it seems like we have idle nodes, they're actually kept alive by one or two pods. I understand the hesitation from the DevOps side, but it's frustrating from a financial perspective to see our capacity committed to these underutilized nodes. What strategies do teams usually implement to address this issue? How can I propose a solution to the DevOps team without coming off as overly simplistic, like merely suggesting they move the pods?
5 Answers
Your DevOps team has valid reasons for their stance. For example, they might have topology constraints and anti-affinity rules to ensure maximum uptime during outages. Keeping certain pods on specific nodes is often essential for stability, even if those nodes could technically be scaled down.
One approach is to identify and isolate the workloads causing the issue. By creating smaller nodes specifically for them, you can minimize waste. If these pods are known and manageable, you can use taints and affinities to better schedule them.
Using tools like Descheduler and Karpenter can help address issues with pod disruption budgets and affinities, allowing you to better manage your resources.
It's important to weigh your options. Is it more costly to handle occasional refunds when the system can't scale up, or should you keep some extra infrastructure running? Yes, you can optimize node packing, but Kubernetes will pull additional nodes as needed. Be sure to consider how efficiently the code in those pods runs; if it's using more resources than necessary, that could be part of the issue.
Consider setting up a dedicated pool of nodes that won't be scaled down. If the DevOps team can't evict pods, they should be using a node selector to ensure those pods run in this special pool.

Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures