I recently added SSDs to my Proxmox + Ceph cluster and set up a new CRUSH rule to create a dedicated `ceph-ssd` pool. The configuration was targeting `class ssd` across hosts, but since I only had two SSD OSDs and set the pool size to `3`, it resulted in PGs becoming `undersized` and `degraded`. What surprised me was that the issue didn't just stop at the SSD pool—it caused major instability across the whole cluster. A number of OSDs crashed, and `pmxcfs` and `corosync` couldn't form a quorum, leading to degraded and unresponsive HDD pools as well. Can anyone explain how a CRUSH rule issue in one pool can destabilize others? Is this typical behavior for Ceph, or did I overlook something? The trouble started when I migrated a VM to the SSD pool, which was almost full.
3 Answers
It sounds like the problem is that all OSD daemons share the cluster state. When an OSD daemon crashes, it drops out of every pool it's part of, causing a ripple effect throughout the cluster. So, even if your SSD pool and HDD pools are separate, their stability still hinges on the health of the OSD daemons.
Exactly! I would imagine they should be independent, but the way Ceph handles cluster state might still tie them together indirectly.
I've not experienced this issue before, but it could be a Proxmox Ceph specific quirk. It’s odd that both `corosync` and `pmxcfs` were impacted since they don't directly involve Ceph. How's your hardware setup? What does your `ceph health detail` output show when things go south? Also, which Ceph and Proxmox versions are you using?
I updated my post with version details!
Could it be due to network saturation? With Ceph attempting to recover while a VM migration was happening, it might have clogged up your network link, causing communication issues between `ceph`, `pmxcfs`, and `corosync`. If you have a monitoring dashboard, check what activity was like at the time of the incident.
But the OP mentioned they're not in the same pool, right? It's not supposed to affect each other directly like that.