Troubleshooting Pod Networking Issues in AWS EKS

June 18, 2025

Asked By TechieTurtle123 On June 18, 2025

Hey everyone, I'm reaching out because I'm dealing with some intermittent networking issues at the pod level in AWS EKS. I'm relatively new to AWS and EKS, so any help would be appreciated! Here's the setup:

- **Environment:** Using AWS Govcloud with a fully private EKS cluster (VPC with private endpoints and hosted zones).
- I've got a vanilla EKS cluster with three addons: VPC CNI, CoreDNS, and Kube-proxy, plus a custom service CIDR range. The worker nodes are configured with the necessary DNS cluster IP flag.

**The Problem:**
I deployed a node group with three nodes and everything worked fine at first. I could communicate between pods and resolve DNS queries. However, the next day, I found that there was no network connectivity at the pod level, and DNS resolutions were failing.

When I scaled the node group to six, the new nodes worked fine, but the original three still had issues with DNS resolution and connectivity.

I've checked that CoreDNS, AWS-Node, and Kube-proxy pods are running without errors. I looked at the kubelet logs and everything seems good. I've verified that the /etc/resolv.conf in pods has the correct CoreDNS IP. I've even enabled CoreDNS logging but didn't see anything helpful.

I did notice some potential bandwidth drops, but I'm not sure if that's the problem. I checked CloudWatch for any logs about dropped connections but didn't find anything alarming.

I've got Ubuntu 22.04 on my self-managed nodes and I'm wondering if the FIPS settings could be contributing to the issue. Any insights or troubleshooting steps would be super helpful! Thanks!

1 Answer

Answered By CloudGuru42 On June 20, 2025

I've faced similar strange networking issues in the past. In some cases, it turned out to be related to memory exhaustion on the nodes. If the available memory gets too low, vital processes like the kubelet can crash without clear error messages, leading to nodes appearing ready but unresponsive. I recommend checking your instance type. I used to run t3.mediums but saw better stability after upgrading to larger instances.

CuriousCoder8 - June 21, 2025

That's an interesting thought! I hadn't considered memory limits. Since I only have Nginx running for testing, I thought t3.mediums would be enough. I'll upgrade to some m5.xlarge instances and see if the problem persists. I also got logs from CoreDNS after deploying on the problematic nodes—it seems like it's not reaching the VPC DNS. I'm currently doing tcpdumps to analyze outgoing traffic. I'll update if there’s any progress!

DebuggingNinja77 - June 21, 2025

Are you still experiencing issues with the larger instances? Sometimes they behave similarly for a few hours before the problem returns, so keep an eye on it!

Troubleshooting Pod Networking Issues in AWS EKS

1 Answer

Related Questions

Cloudflare Origin SSL Certificate Setup Guide

How To Effectively Monetize A Site With Ads

LEAVE A REPLY Cancel reply