I'm currently working as a trainee at my university's super-computing center, and this week one of our Tesla P100 GPUs stopped responding. I've been assigned to diagnose the issue, and I'm looking for any advice or techniques that could help me figure out what's wrong. Any tips would be greatly appreciated!
4 Answers
If you're on Linux, you might start by checking if the GPU shows up as a PCI device using the command 'lspci'. If it identifies as a VGA compatible device, that’s a good sign it’s at least somewhat functional. After that, try running 'nvidia-smi' to see if it’s detected properly. Also, considering its age, it might be time for the GPU to retire, but definitely check the power cables while you're at it!
First of all, try installing the GPU in another machine. If it doesn't work there either, it's likely toast. Straightforward, but sometimes that’s all it takes!
Before diving deep into diagnostics, it might be worth checking if it's a network issue instead. Seriously though, when you mention 'diagnose,' what specific problems are you seeing? Is there no power, no link light, or just not responding at all?
My supervisor checked the cluster control and confirmed the card isn't responding to anything, like it’s completely missing. He wants me to rule out any hardware issues.
Don't forget to run 'nvidia-smi' as well! It can give you some helpful information regarding the GPU's status.

That's a solid approach, thanks for the tip!