Exploring GPU Multi-Tenancy with MIG on H100: Experiences and Advice

0
2
Asked By DigitalNomad43 On

Hey everyone! After months of wrestling with GPU resource contention in our Kubernetes cluster, I finally set up NVIDIA's Multi-Instance GPU (MIG) on our H100s, and it's a total game changer! With MIG, one H100 can now handle up to seven isolated GPU workloads at the same time, each with dedicated memory pools and compute resources. This opens up a ton of possibilities, like running a Jupyter notebook, ML training jobs, and multiple inference services—all on the same physical GPU without any interference. The K8s integration with the GPU Operator is pretty smooth as it automatically manages and schedules workloads based on resource requests. I even put together a complete implementation guide since I noticed a lack of K8s-specific MIG documentation out there. For those of you running GPU workloads in K8s, this could change your game when it comes to resource utilization. What's been your biggest struggle with GPU resource management? Have any of you tried using MIG in a production setting?

5 Answers

Answered By GPUExplorer23 On

It's great to hear about your findings! About MIG’s memory allocations, it can indeed be misleading. The way the profiles describe memory isn’t always reflective of what’s usable. Getting only 10.75Gi from a supposed 12Gi allocation can lead to confusion among new users. Hopefully, NVIDIA addresses this in the future. Have you considered how this might affect your application requirements?

MLMaven92 -

Absolutely! Clarity on memory usage would certainly help in planning. When you first set up, did you also notice discrepancies, or was it pretty straightforward for you?

Answered By OpenSourceFan49 On

I saw your post about using the NVIDIA GPU Operator in K8s. Have you tried pairing that with Talos Linux? I'm curious about the stability issues you mentioned—did a fresh install help? What extensions are essential to get things running smoothly?

SynergySeeker12 -

We did test with Talos, but faced some challenges that prompted a switch to Ubuntu. I'd love to hear if anyone else has had a better experience with Talos using H100s!

Answered By ComputeWiz66 On

MIG is indeed not new, but it's exciting to see its implementation! Time-slicing has been a workaround for a while, especially where MIG isn't supported, but you lose the isolation aspect. I get that static configurations can lead to wasted resources. Have you considered alternatives like RunAI for more dynamic resource management? It might be something worth looking into, especially if flexibility is key for your setup.

CulturedCoder77 -

Fair point! I’m really leaning towards understanding how RunAI stacks up against MIG for folks who need more dynamism. Have you had any experiences with it in comparison?

Answered By CuriousCoder08 On

Have you had a chance to experiment with different MIG configurations, changing the profile shapes dynamically? Last I heard, you had to reboot the host to make those changes, which can be a hassle. It's something I’ve been researching, and I'd love to know your take on it!

DataDynamo45 -

Unfortunately, you still do need to restart the host for those changes. I've been doing the same—keeping multiple MIG layouts handy to minimize restarts. If you find a workaround, let me know!

Answered By TechieGuru89 On

It's awesome to hear about your experiences with MIG! This is a hot topic in the K8s community. I totally get the concerns regarding shared bandwidth and potential drawbacks—I've heard mixed feedback about MIG too. It’s crucial to be aware that while MIG isolates workloads, they do share bandwidth, which can be a real issue with bandwidth-intensive tasks. Have you been facing any performance hiccups? What’s your opinion on the trade-offs? Also, thanks for sharing your blog post; I’m definitely checking it out!

NerdyNinja301 -

You nailed it! Bandwidth management is a key consideration. I'm also curious about the potential overhead with splitting resources. With mixed workloads, how has the performance been? Are there particular workloads that you've found MIG excels at?

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.