Has anyone else been facing issues with GPU clusters where the node appears healthy, yet jobs fail until a reboot? I've noticed that even though the node is up and all metrics look normal, including NVML and DCGM stats, the distributed training or inference jobs tend to stall, hang, or crash. It seems like something isn't quite right beneath the surface, and I'm trying to pinpoint whether specific patterns—like AER noise, Xids, ECC drift, and others—could be indicative of the node becoming unusable. I'm really interested in hearing about anyone's experiences diagnosing such problems: What were the root causes you discovered? Were there any signals that were genuinely predictive, and what turned out to be misleading or irrelevant?
3 Answers
(Vendor note) We published a technical eBook on optimizing GPU resources that might help, though it's not product-related. You can check it out for insights!
I dealt with a similar situation due to silent ECC errors that only became apparent after a node restart. Using lightweight images like Minimus really helped me identify some strange container-related issues.
You might have a faulty GPU if it's a single node failure. Some GPUs can run perfectly fine until they are under load, which could indicate a driver issue too. It'd help to know if this is happening on random nodes or consistently on the same one — any known issues with the GPU would be good to share too!
Thanks for the advice! Can I reach out to you for further details?

Great tip! I’d also recommend running Nvidia's MATS/MODS GPU memory test to rule out hardware problems.