I'm using an APC SMT1500RMI2U UPS along with an AP9630 (NMC2) card to manage my homelab, which consists of TrueNAS, Proxmox, and pfSense via NUT using SNMP. Lately, I've encountered persistent "Communication lost / Data stale" notifications from TrueNAS, so I checked the logs and noticed that the AP9630 is cutting out completely. It stops responding to SNMP requests for almost 68 seconds at a time, but then it comes back up without any issues. The UPS continues to provide power during these outages, so it seems to be just the management interface failing.
To address this, I've adjusted the polling intervals of my NUT clients, spreading them apart with prime numbers to minimize strain on the card. However, I'm still facing these random 68-second disconnects. Has anyone run into a similar issue? Is this potentially a known firmware glitch related to garbage collection or a memory leak, or could it be a classic hardware failure like a failing capacitor on the AP9630? I'm wondering if I need to update the firmware, replace the NMC, or switch to a Master/Slave NUT setup to limit connections to a single IP. Any insights would be appreciated!
3 Answers
You might want to check the port speed and duplex settings on your switch. I had something similar happen where the switch's auto negotiation picked an odd speed setting, which caused connection issues. Playing around with those settings fixed my problem. But keep in mind that if the card is old, it might have other underlying issues too, like a failing capacitor.
It’s weird that you’re getting a consistent 68-second dropout. That could hint at a firmware watchdog timer or even a memory leak causing the card to reboot. I'd recommend monitoring the switch port with a packet capture to see if you’re getting a full link flap or if it’s just the SNMP daemon going down. If you notice that the HTTP management interface is also dropping, that would definitely indicate a card reset.
Good point! If it's the entire card resetting, that's a different ball game than just the SNMP crashing. That could explain the regular interval, too.
First off, you should check how old your firmware is. It could be that updating it might resolve some bugs. Also, APC offers a free trial for their Struxureware monitoring tool, which might help you investigate the issue more easily.

Absolutely! Sometimes these settings can wiggle between speeds and cause issues like you described. It’s a pain, but worth checking!