I set up a Kubernetes cluster about 400 days ago with 3 Control Nodes and 4 Worker Nodes. Recently, I decided to add a 5th worker node and upgrade the whole setup from v1.30. However, since then, I've been experiencing random timeouts leading to vague problems, especially with OpenSearch. The node addition didn't seem complicated, but now I'm seeing warnings about timeouts, VMs struggling, and several Longhorn volumes failing with 'context deadline exceeded'. I need some guidance on where to troubleshoot and what specifics to investigate to get my cluster back on track.
4 Answers
It sounds like your new 5th node might not be set up correctly. I'd suggest checking all components to ensure they're functioning as they should—even down to the MTU settings on the host. Sometimes small discrepancies can cause big problems! Give that a shot and see if anything stands out.
Check your LoadBalancer or VIP settings. I've faced similar problems when the VIP gets announced on multiple network interfaces, causing traffic routing issues. It could lead to timeouts when your services try to communicate. Also, verify your Longhorn disk replication settings—if you have it set to replicate across all nodes, it could cause performance bottlenecks. Monitor your Grafana metrics for any spikes in CPU or network usage; that could give you clues about what’s going wrong.
Can you share the latest dmesg and kubelet logs? It seems like there might be issues with your CNI or CoreDNS. Depending on the network plugin, like Calico, I’ve seen similar timeout problems. Checking the logs from your CNI and CoreDNS will help narrow down the issue.
Did you check for any duplicate IPs or overlapping Pod networks? It's a common mistake. Also, try temporarily shutting down the new 5th worker and see if that stabilizes things. If it does, you'll know that's where the issue lies.

Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures