I'm really impressed with the NVIDIA GPU Operator! It's relieved a lot of the workload for our team when it comes to managing GPU drivers, CUDA versions, and the container toolkit for each node. However, I haven't upgraded any drivers yet and I'm looking for advice from the community. Any recommendations or tricks regarding driver upgrades with this operator? Thanks a lot!
3 Answers
Just curious, are you running a self-hosted Kubernetes cluster or are you using a cloud provider? It can change the way you approach driver management!
Honestly, I'm a bit hesitant about letting the operator handle all driver installations and live mod probing. I come from a more traditional background, so I prefer managing some of these setups at the OS level and just allocate the resources to Kubernetes as needed. I might consider disabling certain features of the operator to keep more control.
I used Ansible to manage my first cluster, but now I prefer using an operator for task automation. The MIG feature seems like a game changer, but unfortunately, my current GPUs don't support it.
It really depends on your compliance requirements and threat model. If it’s strict, managing it manually might be the better route.
Before you dive into upgrades, ensure your support contract is current. We've faced a lot of bugs with new DGX systems, but those issues seem to be clearing up with the latest editions—my last two upgrades went smoothly!
It's a self-hosted bare metal setup.