Issues with Karpenter and Custom AMI on EKS

0
24
Asked By TechSavvy42 On

I'm having a peculiar problem with my EKS cluster while using Karpenter for creating instances alongside KEDA for pod scaling. Since my application has low traffic at times, I want the nodes to scale down to zero. However, I'm facing long pull times for my large images when Karpenter provisions a new instance. To mitigate this, I created a golden Image that has the necessary images baked in, aimed at speeding up the pull process. The source of my image is the amazon-eks-node-al2023-x86_64-standard-1.33-v20251002 AMI. Unfortunately, whenever Karpenter creates a node from this custom AMI, the kube-proxy, aws-node, and pod-identity pods keep crashing repeatedly. On the other hand, using the unmodified latest AMI works without any issues. Here's the setup of my EC2NodeClass specifying the custom AMI and other configurations, but I'm not seeing any error logs from the pods. What could I be overlooking?

3 Answers

Answered By NodeWhisperer08 On

Have you checked the logs directly on the node's file system? It can be tricky if the containers are constantly being recreated, but there might be some clues there.

QuickCheck007 -

For the VPC CNI, remember that the aws-node container won't have logs you expect. Instead, look under /var/log/aws-routed-eni/ on the node.

PodDigger99 -

I did ssh into the nodes and reviewed all logs, including containerd and kubelet. The only thing I found was a restart signal with a 'PodSandBoxChanged' message.

Answered By CloudGuru21 On

Creating a custom AMI can add unnecessary complexity. Instead, you could start with an EKS-provided AMI, pull your images onto it, then snapshot that volume. This way, you can pass the snapshot ID into the node class while still using the standard EKS AMI. Check out the link for more details on this approach.

ImageSaver44 -

I saw that information too. I'm also considering switching to a Bottlerocket AMI for quicker startup times.

Answered By FixItFellow On

I actually resolved my issue by doing two things: I realized that the custom AMI I was using wasn't the same as the one used by the launch template, so I updated it to match (both are for the same EKS version, just different kernels). Additionally, I changed the hop limit for IMDS to 2, as I found that might help pods access instance metadata. I think the second change was the key fix! Anyone else had a similar experience?

HopLimitExpert -

I wouldn't expect the hop limit to matter here since IMDS hop limit issues typically affect non-host network pods. The VPC CNI should work regardless since it operates off the host network.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.