Hey everyone, I'm reaching out because I'm dealing with some intermittent networking issues at the pod level in AWS EKS. I'm relatively new to AWS and EKS, so any help would be appreciated! Here's the setup:
- **Environment:** Using AWS Govcloud with a fully private EKS cluster (VPC with private endpoints and hosted zones).
- I've got a vanilla EKS cluster with three addons: VPC CNI, CoreDNS, and Kube-proxy, plus a custom service CIDR range. The worker nodes are configured with the necessary DNS cluster IP flag.
**The Problem:**
I deployed a node group with three nodes and everything worked fine at first. I could communicate between pods and resolve DNS queries. However, the next day, I found that there was no network connectivity at the pod level, and DNS resolutions were failing.
When I scaled the node group to six, the new nodes worked fine, but the original three still had issues with DNS resolution and connectivity.
I've checked that CoreDNS, AWS-Node, and Kube-proxy pods are running without errors. I looked at the kubelet logs and everything seems good. I've verified that the /etc/resolv.conf in pods has the correct CoreDNS IP. I've even enabled CoreDNS logging but didn't see anything helpful.
I did notice some potential bandwidth drops, but I'm not sure if that's the problem. I checked CloudWatch for any logs about dropped connections but didn't find anything alarming.
I've got Ubuntu 22.04 on my self-managed nodes and I'm wondering if the FIPS settings could be contributing to the issue. Any insights or troubleshooting steps would be super helpful! Thanks!
1 Answer
I've faced similar strange networking issues in the past. In some cases, it turned out to be related to memory exhaustion on the nodes. If the available memory gets too low, vital processes like the kubelet can crash without clear error messages, leading to nodes appearing ready but unresponsive. I recommend checking your instance type. I used to run t3.mediums but saw better stability after upgrading to larger instances.
Are you still experiencing issues with the larger instances? Sometimes they behave similarly for a few hours before the problem returns, so keep an eye on it!
That's an interesting thought! I hadn't considered memory limits. Since I only have Nginx running for testing, I thought t3.mediums would be enough. I'll upgrade to some m5.xlarge instances and see if the problem persists. I also got logs from CoreDNS after deploying on the problematic nodes—it seems like it's not reaching the VPC DNS. I'm currently doing tcpdumps to analyze outgoing traffic. I'll update if there’s any progress!