I'm facing an issue where my EC2 instances are failing to join the Kubernetes cluster on EKS, and I keep getting the "Instances failed to join the kubernetes cluster" error. The creation process for the node group doesn't seem to be successful, and I'm stuck. Here's a snippet of the error message I received:
Error: waiting for EKS Node Group (my-eks-cluster:my-node-group) create: unexpected state 'CREATE_FAILED'. Last error: NodeCreationFailure: Instances failed to join the kubernetes cluster.
I've set up my Terraform code as follows:
provider "aws" {
region = "eu-central-1"
}
# VPC module and everything else here...
Can anyone suggest what I might need to fix to resolve this?
4 Answers
You might want to tweak your security group settings. Either get rid of the restrictive group or allow traffic on port 10250. The cluster API needs to communicate with the kubelets on your nodes, and currently, that seems blocked. Also, consider attaching the AmazonEKSClusterPolicy to your cluster role for proper permissions. Plus, check out the AWS provider docs for some handy examples!
Make sure to inspect the logs on the failing nodes. Start with the cloud-init logs; they can give you insights into any networking or permission issues. Also, ensure you have a CNI plugin installed. Missing network components can lead to the same error you're seeing.
It sounds like a potential networking issue. Double-check your network routes to ensure they're properly configured. Nodes need to communicate, so verify that there's nothing blocking that.
Did you remember to tag your subnets? Any subnet with nodes should have the tag `kubernetes.io/cluster/myclustername: shared`. If you're using that public VPC module, it might be simpler to use the EKS module as well, just for consistency. This could also be an issue with user data scripts not executing properly on the nodes' setup.
Yeah, the user data could be an issue. The bootstrap script used to be a 'thing' but might need updating since kubeadm now prefers a YAML configuration.
I had similar issues before; my nodes couldn't tap into the Internet, causing them to fail. Definitely check that!