Hey everyone, I'm dealing with a frustrating situation with our EPYC Gen 4 processors. We've got three Dell PE 7625 servers, each with dual AMD 9374F processors (32 cores) and 512GB RAM. Unfortunately, we're experiencing some pretty terrible bandwidth issues between VMs and from VMs to the host node on the same server. Specifically, we're only seeing around 13 Gbps from host to VM and about 8 Gbps VM to VM, despite having a 50 Gbps bridge setup with two 25 Gbps ports bonded using LACP and no other traffic going through.
I've tried several measures to fix this but haven't had any luck:
1. Enabled multiqueue in Proxmox with 8 queues, but it didn't help.
2. Adjusted the BIOS settings to NPS=4/2, but again, no change.
3. I ran an iperf test between VMs on an older Intel cluster, and it achieved around 30 Gbps, while my AMD setup only managed 38 Gbps in comparison, which is pretty disappointing given the specs.
I went ahead and tested the same Proxmox version on another server with an Intel Xeon 5410, and it peformed at 68 Gbps under identical conditions. I'm not sure why there's such a difference, and I'm looking for any insight on how to boost the inter core/process bandwidth to maximize throughput. If this keeps up, I'm worried about recommending AMD processors for virtualization to future buyers.
Any suggestions or insights would be really appreciated! I've even tried different operating systems like Redhat and Debian, but performance hasn't improved, and only with Ubuntu 22 did I briefly see 50 Gbps before it dropped again with an upgrade. Thanks!
2 Answers
I think your issue might be due to the network stack being involved even though you're not actually using the network cards for the test. To get a better sense of the true bandwidth between cores, you should try using Intel's Memory Latency Checker. It could give you clearer insights into the performance across your cores.
It sounds like you might be dealing with a NUMA (Non-Uniform Memory Access) issue. Check to ensure that the core and RAM assignments align properly with your server’s architecture. Sometimes mismatches can lead to performance drops like what you’re experiencing.
That’s a good point. I’ll check out the Intel MLC, but it’s puzzling why my AMD setup is lagging behind an entry-level Intel chip. Any ideas?