I'm dealing with a latency-sensitive application that relies heavily on GPU compute, but I'm seeing inconsistent performance with our AWS GPU cloud setup, particularly with latency spikes causing a bottleneck. An AWS Enterprise package representative suggested we look into bare metal servers for better control and reduced latency. Before making a switch, I'd love to get insights on the following: 1. What adjustments or optimizations can we try within AWS to reduce GPU compute latency? 2. Are there any AWS-native tweaks (like placement groups or enhanced networking) that are effective for low-latency GPU workloads? 3. What are the pros and cons of using bare metal for this type of work based on your experiences? 4. Are there any hybrid solutions (combining AWS with bare metal colo) that are worth considering?
2 Answers
I’ve been in the same boat with AWS GPU instances. They're awesome for scaling, but if you're after consistent low latency, they can be tricky. Here’s what I'd recommend trying:
- Look into the newer instance families (like p5, p4d, or g6e) and use bare metal variants to reduce overhead from hypervisors.
- Enable Elastic Fabric Adapter (EFA) and place your nodes in a cluster placement group; this helps significantly with interconnect latency.
- Pin your GPU and CPU processes, disable any CPU power-saving features, and make sure your Elastic Network Adapter (ENA) runs in enhanced mode to help minimize latency spikes.
- Keep your data local on NVMe storage or use FSx for Lustre instead of S3 or EBS for data in transit.
If those tweaks don’t work, switching to bare metal could offer more predictable performance, but you’ll lose some flexibility and have to manage hardware issues.
Have you figured out what's causing the latency? You might want to profile your setup to pinpoint any bottlenecks. It’s important to understand whether the issue lies with the cloud setup or something else.
Totally agree with you. I'm going for Equinix to help with that!