Help! We’re Facing Stale Endpoints After Our EKS Upgrade

0
9
Asked By TechExplorer123 On

We recently upgraded our EKS cluster from version 1.32 to 1.33 on March 7, 2026, along with updating the AMIs for all nodes. Post-upgrade, we started experiencing significant service timeouts even though all our pods looked healthy. After some troubleshooting, we found that deleting the Endpoints objects fixed the issue. It seems like stale Endpoints might have been the cause, and we're reaching out to the AWS EKS team for clarification. During the upgrade, the kube-controller-manager briefly restarted, and we updated the node AMIs, which led to full node replacement and new pod assignments. Since the Endpoints weren't updated during the controller's restart, they might still be pointing to the old nodes. We also noticed this issue recurring regularly in production, especially when we delete a CoreDNS pod, causing internal services to timeout again. We need to understand if stale Endpoints were indeed behind this and how we can prevent it in the future.

3 Answers

Answered By SysAdminSamantha On

Make sure your control plane logs are enabled, as they'll give you better visibility into what's happening during those updates. Definitely submit a support case if you haven't yet, since they might have specific advice or know of any bugs in 1.33 regarding Endpoints.

Answered By DevOpsDude77 On

It's interesting that deleting a CoreDNS pod causes cascading timeouts! It might be worth investigating any admission webhooks that could be interfering with the recycling of your Endpoints. That happened to us once and took forever to troubleshoot! Hope you get it all sorted out soon!

TechExplorer123 -

Thanks! I’ll keep that in mind. It’s great to hear from someone who faced similar challenges.

Answered By CloudGuru99 On

It sounds like you're facing issues with the Kubernetes Endpoints resource, which is essential for service routing. When your kube-controller-manager restarted, it might not have reconciled the new Endpoints correctly, especially with the simultaneous AMI update. I recommend checking the control plane logs for any errors and filing a support case with AWS as they can provide more detailed insights into the upgrade specifics.

KubeMaster42 -

Yes, that makes sense. If the controller missed updates due to the restart, it definitely could lead to stale Endpoints causing those timeouts. Good luck with AWS support!

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.