I'm seeking real-world insights on managing node pool scale-down operations in Oracle Kubernetes Engine (OKE). My objective is to perform a zero-downtime Kubernetes upgrade while minimizing the risks associated with node termination.
Currently, I have a node pool with 3 nodes, and I'm planning to scale it up to 6 nodes. The strategy I've mapped out includes:
- Scaling the node pool from 3 to 6 nodes to allow workloads to reschedule onto the new nodes.
- Cordon and drain the old nodes.
- Then, scale the pool back from 6 to 3 nodes.
However, I have concerns about the scale-down behavior in OKE. In AWS EKS, I know that the oldest instances are terminated first, but I couldn't find any documentation for OKE that confirms the order of node removal when scaling down a node pool.
My questions are:
1. Is there any documented behavior regarding the order of node termination during scale-down in OKE?
2. Does cordoning or draining old nodes affect which nodes OKE decides to remove?
I'm looking for any insights or best practices from those who've handled this in production OKE clusters. Thanks!
2 Answers
It sounds like you're on the right track! When you cordon a node, it prevents new pods from being scheduled there, which makes sure that the existing workloads can move to the new nodes. After that, if you scale down and just drain the nodes, OKE should ideally handle the evictions properly, and if everything is set up right, you shouldn't face any downtime.
For the scale-down order, unfortunately, OKE doesn’t have the same documented behavior as EKS regarding which specific nodes get terminated first. In practice, it often seems to respect cordoning, but it might not be guaranteed. Just keep an eye on your workloads during the process!
You're considering a common approach! Just remember, if you don't delete the old nodes after draining, and you scale the pool down from 6 to 3, you might still have those old nodes lingering around. They won't serve workloads since they're drained, but they could still affect your node limits and configurations. So, it's usually a good idea to clean those up after ensuring everything is running smoothly on the new nodes!

Related Questions
How To: Running Codex CLI on Windows with Azure OpenAI
Set Wordpress Featured Image Using Javascript
How To Fix PHP Random Being The Same
Why no WebP Support with Wordpress
Replace Wordpress Cron With Linux Cron
Customize Yoast Canonical URL Programmatically