I'm having a really odd problem with my EKS cluster. Every day, around the same time, a set of nodes goes into a NotReady state. I've scoured through all the monitoring tools I have, checking control plane logs, EC2 host metrics, CoreDNS, cron jobs, and node logs, but I can't find any spikes or anomalies that could explain why they're getting into this state.
On these affected nodes, the kubelet seems to lose its connection to the API server, showing a timeout error, but it recovers quickly. Despite this being a daily occurrence, I haven't pinpointed the root cause yet. I've even consulted support, but haven't gotten any definite answers. There aren't any apparent resource pressures or network issues that might be triggering this. Has anyone experienced similar issues or have suggestions on what I could investigate further?
6 Answers
Are these NotReady nodes using spot instances? Check if there are eviction notices affecting them, that could also lead to these issues.
I've dealt with similar issues when older versions of CNI or Kube-Proxy were used or if workloads exhausted memory limits on nodes. It could lead to the kubelet losing connectivity temporarily.
So, is this happening at exactly the same time each day for a specific duration? Also, are these nodes all from the same node group? If there's something in common between them, that might reveal a pattern.
Have you checked whether anything could be overloading the API server? It might be worth taking a look at the control plane API server logs or reaching out to AWS support to monitor the control plane at this specific time.
I looked into the control plane logs, but didn't find anything conclusive. Recently, AWS enabled control plane monitoring, and I did notice an uptick in API requests, but it seems more like a symptom than a cause.
Any chance there are backup jobs running around that time? Or could AWS be doing some updates or checks? While it shouldn't normally cause this, it's worth considering along with opening a support ticket with AWS.
Try installing ethtool to investigate any dropped packets. If the counters show activity, you might need to move to a higher bandwidth instance type. You can use a command like `ethtool -S ens5 | grep exceeded` to check it.
Yes, it happens at pretty much the same time daily, although the timing shifts a bit every few weeks. These nodes don't seem to have anything in common configuration-wise.