From a FinOps perspective, I'm seeing that many nodes in our cluster are only utilizing about 20-30% of their capacity. While I see this as an opportunity to consolidate resources and reduce the number of nodes, my discussions with the DevOps team reveal that certain pods are effectively unevictable. Factors like pod disruption budgets, local storage, strict affinities, or simply the absence of other nodes capable of hosting those pods contribute to this issue. So, despite having idle nodes, one or two pods keep them active. I understand their hesitance to make changes, but it's frustrating to see committed capacity tied up in these barely-used nodes. What strategies can teams implement to manage these unevictable pods so that we can eventually consolidate nodes? I'm looking for constructive proposals to discuss with the DevOps lead that go beyond simply moving the pods around.
3 Answers
When considering costs, think about the impact of downtime versus the expense of maintaining a little extra infrastructure. Yes, you can optimize node packing, but when Kubernetes needs more resources, it will happily pull from additional nodes. The efficiency of the code running in the pods is also essential. Ensure that memory and CPU usage is as optimized as possible before making sweeping changes.
It’s a good move to identify those workloads and potentially create smaller nodes for them to minimize wasted resources. If those pods are known and controllable, consider scheduling them using taints and affinities to ensure they're utilizing the right nodes effectively.
Your DevOps team is bringing up valid points. It's important to value uptime, especially in cases where topology constraints and anti-affinity rules are in play. Sometimes, keeping those pods on nodes may be preferable to reducing their count just for the sake of it.

Just be cautious with node sizes; in certain clouds, like GCP, disk throughput is linked to VM size. So, if you need fast performance for smaller workloads, you may still require larger VMs.