What’s the best way to diagnose a non-responsive GPU?

0
13
Asked By TechyTurtle99 On

I'm currently working as a trainee at my university's super-computing center, and this week one of our Tesla P100 GPUs stopped responding. I've been assigned to diagnose the issue, and I'm looking for any advice or techniques that could help me figure out what's wrong. Any tips would be greatly appreciated!

4 Answers

Answered By GadgetGuru12 On

If you're on Linux, you might start by checking if the GPU shows up as a PCI device using the command 'lspci'. If it identifies as a VGA compatible device, that’s a good sign it’s at least somewhat functional. After that, try running 'nvidia-smi' to see if it’s detected properly. Also, considering its age, it might be time for the GPU to retire, but definitely check the power cables while you're at it!

Answered By ByteMe2021 On

First of all, try installing the GPU in another machine. If it doesn't work there either, it's likely toast. Straightforward, but sometimes that’s all it takes!

TechyTurtle99 -

That's a solid approach, thanks for the tip!

Answered By CircuitSquad101 On

Before diving deep into diagnostics, it might be worth checking if it's a network issue instead. Seriously though, when you mention 'diagnose,' what specific problems are you seeing? Is there no power, no link light, or just not responding at all?

TechyTurtle99 -

My supervisor checked the cluster control and confirmed the card isn't responding to anything, like it’s completely missing. He wants me to rule out any hardware issues.

Answered By PixelPioneer On

Don't forget to run 'nvidia-smi' as well! It can give you some helpful information regarding the GPU's status.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.