Hey everyone! I'm trying to follow the Nvidia AI Enterprise deployment guide for Azure AKS but I'm running into issues at the final step. I'm attempting to run the command: `helm install gpu-operator nvaie/gpu-operator-4-0 --version 23.6.1 --set driver.repository=nvcr.io/nvaie,driver.licensingConfig.configMapName=licensing-config --namespace gpu-operator`. However, it's not working for me, and I'm getting some strange errors. If anyone has experience with this, I'd really appreciate any tips or insights on what to check during this installation process!
2 Answers
It sounds like you might be missing some info on the licensing config. Make sure your config map called 'licensing-config' actually contains a valid token, as the helm command relies on that. If you can share the exact error message, that would help too!
I’ve had similar issues, and I noticed that sometimes the Nvidia docs have a different command syntax for the helm install. You might want to check this out: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#operator-install-guide to see if there’s something you need to adjust for your setup.
I have set up the tokens correctly from the Nvidia license server in client_configuration_token.tok. But now I'm seeing this error: 'Startup probe failed: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.' Any ideas on what could be causing that?