System Operations

Why did my SSD pool misconfiguration affect my entire Ceph cluster?

September 2, 2025

Asked By TechNinja42 On September 2, 2025

I recently added SSDs to my Proxmox + Ceph cluster and set up a new CRUSH rule to create a dedicated `ceph-ssd` pool. The configuration was targeting `class ssd` across hosts, but since I only had two SSD OSDs and set the pool size to `3`, it resulted in PGs becoming `undersized` and `degraded`. What surprised me was that the issue didn't just stop at the SSD pool—it caused major instability across the whole cluster. A number of OSDs crashed, and `pmxcfs` and `corosync` couldn't form a quorum, leading to degraded and unresponsive HDD pools as well. Can anyone explain how a CRUSH rule issue in one pool can destabilize others? Is this typical behavior for Ceph, or did I overlook something? The trouble started when I migrated a VM to the SSD pool, which was almost full.

3 Answers

Answered By DatacenterDude99 On September 6, 2025

It sounds like the problem is that all OSD daemons share the cluster state. When an OSD daemon crashes, it drops out of every pool it's part of, causing a ripple effect throughout the cluster. So, even if your SSD pool and HDD pools are separate, their stability still hinges on the health of the OSD daemons.

CloudMaster21 - September 6, 2025

But the OP mentioned they're not in the same pool, right? It's not supposed to affect each other directly like that.

TechNinja42 - September 6, 2025

Exactly! I would imagine they should be independent, but the way Ceph handles cluster state might still tie them together indirectly.

Answered By SysAdminGuru88 On September 5, 2025

I've not experienced this issue before, but it could be a Proxmox Ceph specific quirk. It’s odd that both `corosync` and `pmxcfs` were impacted since they don't directly involve Ceph. How's your hardware setup? What does your `ceph health detail` output show when things go south? Also, which Ceph and Proxmox versions are you using?

TechNinja42 - September 6, 2025

I updated my post with version details!

Answered By NetSlingerX On September 4, 2025

Could it be due to network saturation? With Ceph attempting to recover while a VM migration was happening, it might have clogged up your network link, causing communication issues between `ceph`, `pmxcfs`, and `corosync`. If you have a monitoring dashboard, check what activity was like at the time of the incident.

Why did my SSD pool misconfiguration affect my entire Ceph cluster?

3 Answers

Related Questions

Can't Load PhpMyadmin On After Server Update

Redirect www to non-www in Apache Conf

How To Check If Your SSL Cert Is SHA 1

Windows TrackPad Gestures

LEAVE A REPLY Cancel reply