I'm having a peculiar problem with my EKS cluster while using Karpenter for creating instances alongside KEDA for pod scaling. Since my application has low traffic at times, I want the nodes to scale down to zero. However, I'm facing long pull times for my large images when Karpenter provisions a new instance. To mitigate this, I created a golden Image that has the necessary images baked in, aimed at speeding up the pull process. The source of my image is the amazon-eks-node-al2023-x86_64-standard-1.33-v20251002 AMI. Unfortunately, whenever Karpenter creates a node from this custom AMI, the kube-proxy, aws-node, and pod-identity pods keep crashing repeatedly. On the other hand, using the unmodified latest AMI works without any issues. Here's the setup of my EC2NodeClass specifying the custom AMI and other configurations, but I'm not seeing any error logs from the pods. What could I be overlooking?
3 Answers
Have you checked the logs directly on the node's file system? It can be tricky if the containers are constantly being recreated, but there might be some clues there.
I did ssh into the nodes and reviewed all logs, including containerd and kubelet. The only thing I found was a restart signal with a 'PodSandBoxChanged' message.
Creating a custom AMI can add unnecessary complexity. Instead, you could start with an EKS-provided AMI, pull your images onto it, then snapshot that volume. This way, you can pass the snapshot ID into the node class while still using the standard EKS AMI. Check out the link for more details on this approach.
I saw that information too. I'm also considering switching to a Bottlerocket AMI for quicker startup times.
I actually resolved my issue by doing two things: I realized that the custom AMI I was using wasn't the same as the one used by the launch template, so I updated it to match (both are for the same EKS version, just different kernels). Additionally, I changed the hop limit for IMDS to 2, as I found that might help pods access instance metadata. I think the second change was the key fix! Anyone else had a similar experience?
I wouldn't expect the hop limit to matter here since IMDS hop limit issues typically affect non-host network pods. The VPC CNI should work regardless since it operates off the host network.

For the VPC CNI, remember that the aws-node container won't have logs you expect. Instead, look under /var/log/aws-routed-eni/ on the node.