System Operations

EKS Nodes Going NotReady at the Same Time Daily – Help!

April 19, 2025

Asked By CuriousCat123 On April 19, 2025

I'm having a really odd problem with my EKS cluster. Every day, around the same time, a set of nodes goes into a NotReady state. I've scoured through all the monitoring tools I have, checking control plane logs, EC2 host metrics, CoreDNS, cron jobs, and node logs, but I can't find any spikes or anomalies that could explain why they're getting into this state.

On these affected nodes, the kubelet seems to lose its connection to the API server, showing a timeout error, but it recovers quickly. Despite this being a daily occurrence, I haven't pinpointed the root cause yet. I've even consulted support, but haven't gotten any definite answers. There aren't any apparent resource pressures or network issues that might be triggering this. Has anyone experienced similar issues or have suggestions on what I could investigate further?

6 Answers

Answered By SpotlessHunter On April 22, 2025

Are these NotReady nodes using spot instances? Check if there are eviction notices affecting them, that could also lead to these issues.

Answered By MemoryMaven On April 21, 2025

I've dealt with similar issues when older versions of CNI or Kube-Proxy were used or if workloads exhausted memory limits on nodes. It could lead to the kubelet losing connectivity temporarily.

Answered By NodeMaster99 On April 21, 2025

So, is this happening at exactly the same time each day for a specific duration? Also, are these nodes all from the same node group? If there's something in common between them, that might reveal a pattern.

CuriousCat123 - April 22, 2025

Yes, it happens at pretty much the same time daily, although the timing shifts a bit every few weeks. These nodes don't seem to have anything in common configuration-wise.

Answered By TechWhiz22 On April 21, 2025

Have you checked whether anything could be overloading the API server? It might be worth taking a look at the control plane API server logs or reaching out to AWS support to monitor the control plane at this specific time.

CuriousCat123 - April 22, 2025

I looked into the control plane logs, but didn't find anything conclusive. Recently, AWS enabled control plane monitoring, and I did notice an uptick in API requests, but it seems more like a symptom than a cause.

Answered By BackupGuru42 On April 20, 2025

Any chance there are backup jobs running around that time? Or could AWS be doing some updates or checks? While it shouldn't normally cause this, it's worth considering along with opening a support ticket with AWS.

Answered By PacketNinja88 On April 20, 2025

Try installing ethtool to investigate any dropped packets. If the counters show activity, you might need to move to a higher bandwidth instance type. You can use a command like `ethtool -S ens5 | grep exceeded` to check it.

EKS Nodes Going NotReady at the Same Time Daily – Help!

6 Answers

Related Questions

Can't Load PhpMyadmin On After Server Update

Redirect www to non-www in Apache Conf

How To Check If Your SSL Cert Is SHA 1

Windows TrackPad Gestures

LEAVE A REPLY Cancel reply