Hey everyone! I'm in a bit of a bind and could really use your expertise. I'm the only sysadmin for a small IT team managing a Dell VXRail hyperconverged cluster with four ESXi hosts running around 50 VMs on version 6.7. My vCenter Server appliance (which runs on Photon OS) along with an external platform services controller are both virtual and located in the cluster. I've been able to log into vSphere, but I see very little in the UI besides the administration tab. There's a banner that says it can't connect to the vCenter URL on port 443.
I have the admin account credentials for vSphere, and I'm able to access the root passwords for both the vCenter and PSC appliances. I've ensured shell login is enabled too. I've got snapshots from before I started trying to fix this mess.
Some initial checks showed that storage was a bit high at 95%, so I cleared out old files and brought it down to 40%. I ran 'fsck' and it reported clean volumes, so I'm not sure if there's corruption.
Most crucially, the VPXD service isn't starting, and it gives a system error, prompting me to check the support bundle. I found some LDAP and certificate errors in the logs but haven't had any luck fixing it. Besides VPXD, the vCenter Server Services and Content Library Service also won't start. Our backup solution isn't functioning either, and I need to resolve this before seeking paid support. Any advice would be greatly appreciated!
4 Answers
I suggest digging into the VPXD service logs located at /var/log/vmware/vpxd/. Those logs might provide some clues as to why the service is failing. And just a heads-up, you mentioned your support contract expired, so you might want to get creative if help from VMware isn’t an option!
Make sure to check your certificates as well; they can often cause problems if something goes wrong with them. You can run some commands to verify their status. It's probably the last thing you'll want to overlook!
As someone mentioned, using the command to list your certs might be a worthwhile step. You want to make sure everything is valid. Something like this should help you:
```for store in $(/usr/lib/vmware-vmafd/bin/vecs-cli store list | grep -v TRUSTED_ROOT_CRLS); do echo "[*] Store :" $store; /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store $store --text | grep -ie "Alias" -ie "Not After";done;``` Check to see if they’re valid!
Looks like all certs are valid until 2030, so I hope that’s not the problem.
First off, let's start by checking if the time is set correctly on your appliances, as incorrect time can cause a ton of issues. Make sure they're synced with NTP, as that can often lead to connectivity problems. Just double-check how the timezone is configured, it's worth a look!
Time appears to be correct, and I've synchronized NTP from one of the domain controllers. I tried switching the timezone to match ours, but it didn't help.
I've looked through vpxd.log and did notice LDAP related errors and issues with certificates. I even removed the LDAP configuration since it wasn't being used, but that didn’t solve anything. Unfortunately, our support has expired.