Hi everyone! I'm currently facing an issue with a surprisingly high rate of TCP retransmissions in our Kubernetes cluster. Our node-exporter occasionally shows spikes of up to 3% retransmitted segments, and even the baseline rate hangs around 0.5% to 1.5%, which feels excessive. Here's a quick rundown of our setup: we've got dual-port 10 Gb NICs on each server, a Cilium networking setup, and our K8s version is 1.31.6+rke2r1.
In terms of performance, we ran a couple of tests using iperf3: from server to server, we achieved about 8.5 to 9.3 Gbps, and the pod-to-pod performance was around 5.0 to 7.2 Gbps. Both tests show similar numbers for retransmitted segments.
So my questions are:
1. Where should I dive deeper to find out why packets are dropping? Should I check the NICs, switches, Cilium configuration, or kernel settings?
2. Does the throughput I'm seeing seem normal given this hardware and CNI setup, or should I aim for better?
4 Answers
Have you looked at retransmissions across all your NICs? If just one or two are having problems, it could be something as simple as oxidized connections at the termination points. Also, consider how many servers you have and the bandwidth of your internal switches. I faced a similar situation where a low-cost switch bottlenecked the internal data flow, leading to packet loss.
Also, be sure to check for any drops reported by Cilium itself. I had a similar issue recently that stemmed from a specific Cilium bug. It's definitely worth taking a look at any known issues on their GitHub page related to retransmissions!
Since you're using Cilium, can you share your configuration? Information like your Cilium version, routing mode, and tunneling settings can be crucial for diagnosing these retransmission issues. Each of these variables can influence networking performance.
What kernel version are your hosts running? If they're VMs, it’s also worth noting which hypervisor you’re using. These factors can impact networking significantly, and it might not hurt to dig into their performance metrics.
Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures