Diagnosing High TCP Retransmissions in Kubernetes: Where to Look?

0
0
Asked By TechScribe99 On

Hi everyone! I'm currently facing an issue with a surprisingly high rate of TCP retransmissions in our Kubernetes cluster. Our node-exporter occasionally shows spikes of up to 3% retransmitted segments, and even the baseline rate hangs around 0.5% to 1.5%, which feels excessive. Here's a quick rundown of our setup: we've got dual-port 10 Gb NICs on each server, a Cilium networking setup, and our K8s version is 1.31.6+rke2r1.

In terms of performance, we ran a couple of tests using iperf3: from server to server, we achieved about 8.5 to 9.3 Gbps, and the pod-to-pod performance was around 5.0 to 7.2 Gbps. Both tests show similar numbers for retransmitted segments.

So my questions are:
1. Where should I dive deeper to find out why packets are dropping? Should I check the NICs, switches, Cilium configuration, or kernel settings?
2. Does the throughput I'm seeing seem normal given this hardware and CNI setup, or should I aim for better?

4 Answers

Answered By CgroupGuru44 On

Have you looked at retransmissions across all your NICs? If just one or two are having problems, it could be something as simple as oxidized connections at the termination points. Also, consider how many servers you have and the bandwidth of your internal switches. I faced a similar situation where a low-cost switch bottlenecked the internal data flow, leading to packet loss.

Answered By CloudWatcher99 On

Also, be sure to check for any drops reported by Cilium itself. I had a similar issue recently that stemmed from a specific Cilium bug. It's definitely worth taking a look at any known issues on their GitHub page related to retransmissions!

Answered By CiliumConsultant77 On

Since you're using Cilium, can you share your configuration? Information like your Cilium version, routing mode, and tunneling settings can be crucial for diagnosing these retransmission issues. Each of these variables can influence networking performance.

Answered By KernelWhisperer88 On

What kernel version are your hosts running? If they're VMs, it’s also worth noting which hypervisor you’re using. These factors can impact networking significantly, and it might not hurt to dig into their performance metrics.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.