I've been dealing with a persistent issue in my local kubeadm cluster where the kube-proxy pods are stuck in a CrashLoopBackOff state across all the worker nodes. To give you some context, my setup has 4 nodes (one control and three worker nodes with 128 CPUs each), using containerd as the runtime and Flannel for the CNI. The kube-proxy pods manage to start and sync their caches but then crash consistently after about 1 minute and 20 seconds, showing exit code 2. A hard reset hasn't resolved the issue either.
Currently, the Flannel CNI pods seem stable, and all control plane components appear healthy. Most logs indicate that kube-proxy initializes correctly, with the only warning being about the `nodePortAddresses` being unset, which I understand doesn't necessarily lead to problems.
I'm really looking for insights from anyone who's experienced this similar crash pattern. Has anyone seen kube-proxy start without issues but crash after a little while? And more importantly, what could be causing that exit code 2 after what seems like a successful startup? Any troubleshooting tips or similar experiences would be super helpful!
1 Answer
You might be facing a cgroup driver mismatch. It's worth checking whether you're using the systemd driver or the default cgroupfs driver in kubelet. This can cause issues like what you're seeing with kube-proxy.
What made you think it was a cgroup issue?
Wow, you nailed it! There was indeed a mismatch between cgroupfs and systemd on all the nodes except the control plane. A quick ViM change on each node and it was sorted. Appreciate the help!