Need Help with AKS Nvidia GPU Operator Installation Issues

0
0
Asked By TechieMcGadget On

Hey everyone! I'm trying to follow the Nvidia AI Enterprise deployment guide for Azure AKS but I'm running into issues at the final step. I'm attempting to run the command: `helm install gpu-operator nvaie/gpu-operator-4-0 --version 23.6.1 --set driver.repository=nvcr.io/nvaie,driver.licensingConfig.configMapName=licensing-config --namespace gpu-operator`. However, it's not working for me, and I'm getting some strange errors. If anyone has experience with this, I'd really appreciate any tips or insights on what to check during this installation process!

2 Answers

Answered By CloudSage42 On

It sounds like you might be missing some info on the licensing config. Make sure your config map called 'licensing-config' actually contains a valid token, as the helm command relies on that. If you can share the exact error message, that would help too!

TechieMcGadget -

I have set up the tokens correctly from the Nvidia license server in client_configuration_token.tok. But now I'm seeing this error: 'Startup probe failed: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.' Any ideas on what could be causing that?

Answered By GPUGuru77 On

I’ve had similar issues, and I noticed that sometimes the Nvidia docs have a different command syntax for the helm install. You might want to check this out: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#operator-install-guide to see if there’s something you need to adjust for your setup.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.