EKS Nodes Going NotReady at the Same Time Daily – Help!

0
5
Asked By CuriousCat123 On

I'm having a really odd problem with my EKS cluster. Every day, around the same time, a set of nodes goes into a NotReady state. I've scoured through all the monitoring tools I have, checking control plane logs, EC2 host metrics, CoreDNS, cron jobs, and node logs, but I can't find any spikes or anomalies that could explain why they're getting into this state.

On these affected nodes, the kubelet seems to lose its connection to the API server, showing a timeout error, but it recovers quickly. Despite this being a daily occurrence, I haven't pinpointed the root cause yet. I've even consulted support, but haven't gotten any definite answers. There aren't any apparent resource pressures or network issues that might be triggering this. Has anyone experienced similar issues or have suggestions on what I could investigate further?

6 Answers

Answered By SpotlessHunter On

Are these NotReady nodes using spot instances? Check if there are eviction notices affecting them, that could also lead to these issues.

Answered By MemoryMaven On

I've dealt with similar issues when older versions of CNI or Kube-Proxy were used or if workloads exhausted memory limits on nodes. It could lead to the kubelet losing connectivity temporarily.

Answered By NodeMaster99 On

So, is this happening at exactly the same time each day for a specific duration? Also, are these nodes all from the same node group? If there's something in common between them, that might reveal a pattern.

CuriousCat123 -

Yes, it happens at pretty much the same time daily, although the timing shifts a bit every few weeks. These nodes don't seem to have anything in common configuration-wise.

Answered By TechWhiz22 On

Have you checked whether anything could be overloading the API server? It might be worth taking a look at the control plane API server logs or reaching out to AWS support to monitor the control plane at this specific time.

CuriousCat123 -

I looked into the control plane logs, but didn't find anything conclusive. Recently, AWS enabled control plane monitoring, and I did notice an uptick in API requests, but it seems more like a symptom than a cause.

Answered By BackupGuru42 On

Any chance there are backup jobs running around that time? Or could AWS be doing some updates or checks? While it shouldn't normally cause this, it's worth considering along with opening a support ticket with AWS.

Answered By PacketNinja88 On

Try installing ethtool to investigate any dropped packets. If the counters show activity, you might need to move to a higher bandwidth instance type. You can use a command like `ethtool -S ens5 | grep exceeded` to check it.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.