How can I find out which pods are blocking my EKS cluster upgrade?

0
4
Asked By TechieGuru99 On

I'm trying to upgrade my EKS cluster from Kubernetes version 1.31 to 1.32, but I'm stuck because the upgrade for my managed node group's worker nodes isn't happening. I've set up the terraform-aws-eks module at version 20.36.0 and used `cluster_force_update_version = true`, but it's not forcing the upgrade as the documentation suggests for situations involving `podEvictionError`.

The control plane upgrade to 1.32 went smoothly, so I'm puzzled about which pods might be causing this `podEvictionError`. To help troubleshoot, I moved all my workloads using EBS-backed PVCs into a single AZ managed node group to avoid potential scheduling conflicts. My longest `terminationGracePeriodSeconds` is on Flux (10 minutes), while ingress controllers have 5 minutes, and my upgrade attempt times out after 30 minutes. I'm also using default `podDisruptionBudgets` from various helm charts for components like kube-prometheus-stack, cluster-autoscaler, nginx, cert-manager, etc.

What steps can I take to identify the pods that are causing this upgrade block or resolve the issue? Thanks for any insights!

3 Answers

Answered By DevOpsDude88 On

First, check if the pods with issues have a Pod Disruption Budget (PDB) associated with them. If they do, you may need to adjust the PDB conditions to permit manual disruptions or just temporarily remove these budgets to see if that helps with the upgrade.

Answered By CloudNinja77 On

Have you looked for PVCs or finalizers? Sometimes those can cause the upgrade to hang, especially if they're stuck. I noticed that loki can leave PVCs lingering around even after you destroy it with Terraform. That might be a part of the issue you're facing.

Answered By K8sMaster101 On

Do you have Calico installed? I found during my own 1.32 upgrade that the tigera-operator has tolerations for both NoExecute and NoSchedule. It kept getting scheduled onto the node meant for replacement, which led to multiple upgrade failures before I caught onto it.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.