System Operations

Diagnosing High TCP Retransmissions in Kubernetes: Where to Look?

May 16, 2025

Asked By TechScribe99 On May 16, 2025

Hi everyone! I'm currently facing an issue with a surprisingly high rate of TCP retransmissions in our Kubernetes cluster. Our node-exporter occasionally shows spikes of up to 3% retransmitted segments, and even the baseline rate hangs around 0.5% to 1.5%, which feels excessive. Here's a quick rundown of our setup: we've got dual-port 10 Gb NICs on each server, a Cilium networking setup, and our K8s version is 1.31.6+rke2r1.

In terms of performance, we ran a couple of tests using iperf3: from server to server, we achieved about 8.5 to 9.3 Gbps, and the pod-to-pod performance was around 5.0 to 7.2 Gbps. Both tests show similar numbers for retransmitted segments.

So my questions are:
1. Where should I dive deeper to find out why packets are dropping? Should I check the NICs, switches, Cilium configuration, or kernel settings?
2. Does the throughput I'm seeing seem normal given this hardware and CNI setup, or should I aim for better?

5 Answers

Answered By NetworkNinja23 On May 18, 2025

First things first, have you checked your MTU settings? A common culprit for packet drops could be MTU issues. You can do this by running the ping command with varying sizes to see where fragmentation begins. It’s important to make sure your physical NICs are configured to handle the overhead if you're using something like VXLAN.

DataDude42 - May 17, 2025

Absolutely! Plus, if you're using overlays like Flannel, there's typically an MTU limit you need to consider. If you find that your MTU settings are off, try adjusting them across your infrastructure for smooth sailing.

Answered By CgroupGuru44 On May 17, 2025

Have you looked at retransmissions across all your NICs? If just one or two are having problems, it could be something as simple as oxidized connections at the termination points. Also, consider how many servers you have and the bandwidth of your internal switches. I faced a similar situation where a low-cost switch bottlenecked the internal data flow, leading to packet loss.

Answered By CloudWatcher99 On May 17, 2025

Also, be sure to check for any drops reported by Cilium itself. I had a similar issue recently that stemmed from a specific Cilium bug. It's definitely worth taking a look at any known issues on their GitHub page related to retransmissions!

Answered By CiliumConsultant77 On May 17, 2025

Since you're using Cilium, can you share your configuration? Information like your Cilium version, routing mode, and tunneling settings can be crucial for diagnosing these retransmission issues. Each of these variables can influence networking performance.

Answered By KernelWhisperer88 On May 16, 2025

What kernel version are your hosts running? If they're VMs, it’s also worth noting which hypervisor you’re using. These factors can impact networking significantly, and it might not hurt to dig into their performance metrics.

Diagnosing High TCP Retransmissions in Kubernetes: Where to Look?

5 Answers

Related Questions

Can't Load PhpMyadmin On After Server Update

Redirect www to non-www in Apache Conf

How To Check If Your SSL Cert Is SHA 1

Windows TrackPad Gestures

LEAVE A REPLY Cancel reply