How Can I Track GPU Utilization Per Pod with NVIDIA Time-Slicing?

0
5
Asked By TechWanderer42 On

Hey there! I'm developing a cost-optimization tool for Kubernetes, and I've hit a bit of a snag. One of my customers is running inference workloads and splitting their NVIDIA A10s into 5 virtual GPUs using the NVIDIA device plugin. The trouble is, they have no visibility into which pod is actually utilizing the VRAM and compute resources on those slices. I'm aiming to integrate a solution for them but want to avoid starting from scratch with a custom monitoring agent.

I've checked out a couple of options: 1. **NVIDIA dcgm-exporter:** It seems to be the go-to, but I've heard that the process of mapping the metrics back to specific pods can get complicated or even break down with time-slicing. 2. **Kepler (eBPF):** It seems really robust for tracking power and usage by process IDs, but it might be more than I need.

For those of you utilizing virtual GPUs or time-slicing in production, how are you managing to get accurate per-pod VRAM and utilization metrics? Do you just have them deploy the GPU Operator and pull data from Prometheus, or is there a better lightweight solution available? Any insights would be greatly appreciated!

3 Answers

Answered By CuriousCoder99 On

I've been in a similar situation, and what worked best for us was getting the GPU Operator metrics integrated into our monitoring solution. It gives decent visibility without too much hassle. You might want to look into it!

Answered By GPUWhisperer88 On

I'm glad to see someone talking about this! I usually just monitor the processes running directly on the GPU from the host system. It's not the cleanest way, but it works for managing usage. Just not sure how you'd scale that approach effectively.

Answered By DataDigger24 On

Same problem here! If you find a solution that works, definitely share the update!

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.