Trouble with Stale Endpoints After Upgrading EKS to 1.33

0
10
Asked By TechieWizard99 On

After upgrading our EKS cluster from version 1.32 to 1.33 on March 7, 2026, we started experiencing widespread service timeouts, even though all our pods seemed healthy. We think that stale Endpoints might be to blame because deleting the Endpoints objects fixed the issue. The kube-controller-manager briefly restarted during the upgrade, and at the same time, we updated the node AMI, which replaced nodes across the cluster, causing pods to receive new IPs. Subsequently, some of our internal services, such as argocd-repo-server and argo-redis, faced timeouts. We'd like to understand if stale Endpoints were indeed the reason behind these issues and whether there's a possibility of missing events during the kube-controller-manager restart. We're also curious if there's a preferred sequence for upgrades to prevent this and how we can ensure stale Endpoints don't linger or get reconciled automatically.

3 Answers

Answered By CloudNinja42 On

That sounds pretty rough! Have you looked into your CoreDNS monitoring? It might be that CoreDNS struggled to keep up during the node replacements. If you're not using any auto-scaling or custom configs for CoreDNS, it could be getting overwhelmed. Also, check the logs for anything unusual around the time the issues began.

Answered By DockerDynamo56 On

Wow, that's timely for me too! I'm about to upgrade to 1.33 as well. Good luck getting everything back in line. Let us know what AWS says!

Answered By KubeGuru77 On

Did you manage to check CoreDNS logs during that timeframe? Sometimes those logs can provide hints about what was going wrong. But if nothing's logging, that might be a red flag to look into!

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.