I'm reaching out for some real-world advice on how to restructure two bare-metal Kubernetes clusters I've inherited. They're quite messy, and I want to stabilize our setup to make it more production-ready. Here's where we currently stand:
**Cluster 1: "Old Reliable"**
- **Age:** 3 years old and generally stable.
- **Storage:** It's running Portworx, but due to changes in their licensing, we need to migrate soon.
- **Key Services:** This cluster is home to our company's SSO (Keycloak), a Harbor Registry, and various utility services.
- **Networking:** There's a mix of HTTP and HTTPS termination.
**Cluster 2: "Wild West"**
- **The Problem:** This newer cluster has become chaotic with several worker nodes running legacy Docker Compose services outside of Kubernetes.
- **Risky Setup:** There's a single worker node serving as both the NFS storage provisioner and Docker registry. If that node fails, the entire cluster goes down—something I couldn't control before.
- **Networking:** It's only running HTTP, with SSL termination at an external edge proxy.
**Challenges with IT:** Both clusters are behind an Nginx edge proxy managed by a different IT team, so any changes require a ticket, limiting our direct control.
Here's my plan, but I'd appreciate your thoughts:
1. **Storage Migration:** With Portworx off the table, should I go for Longhorn or Rook/Ceph? I'm concerned about learning curves and performance.
2. **Decoupling the "Master" Node:** I want to remove the registry and NFS storage from the single worker node; should I aim for dedicated storage servers or consider a distributed option like OpenEBS?
3. **Cleaning the Nodes:** How do I evict the Docker Compose services with minimal downtime? I was considering cordoning nodes, wiping them, and bringing them back clean.
4. **Streamlining Traffic:** It'd be great to simplify our interactions with the IT team on proxy changes. Should I push for a wildcard to point to an Ingress Controller and manage via CRDs?
5. **Cloud Utilization:** Lastly, I hope to move some low-security but critical workloads to the Cloud. Any insights or storage concerns regarding this?
If anyone has faced a similar hybrid node situation, I'd love to hear how you managed to get approval for a complete rebuild. Any specific tips regarding the Portworx migration would also be immensely helpful. Thanks!
3 Answers
Honestly, it seems like Kubernetes might be overkill for your situation. If the mixed workloads are causing chaos, consider simplifying things and focus on migrating away from Docker on those nodes. For the NFS storage, if you can get a reliable external provider, I’d recommend that over in-cluster storage anytime. Tackling the cloud migration plan will need its own business case, so yeah, make sure to build that argument as you proceed with fixing the current problems first.
I get that you're feeling overwhelmed, and I totally sympathize! If you can, bringing in some help would be a great idea. It sounds like you’re on the right path, but there are likely edge cases you're not covered for yet. Don't hesitate to ask around, collaboration often leads to better solutions especially with setups like yours!
I appreciate that! I’m gathering insights, but operating in a hybrid situation like this can get nerve-wracking—thanks for the encouragement!
It sounds like a tough spot! For your storage migration, Longhorn is usually a solid choice for smaller setups. If you're leaning towards Rook/Ceph, just be ready for some overhead and a steeper learning curve, especially with configuration. As for those Docker Compose services, cordoning off nodes one by one and cleaning them sounds like the right approach to minimize downtime while ensuring a smoother transition. Label everything afterward and disable unnecessary access to keep things tidy for the future.
Totally agree! It's crucial to have a plan for both storage types—fast SSD and slower HDD. I'd recommend setting up a tiered storage system with Longhorn for optimized performance.

That’s good advice! Also, just push for any non-urgent tasks—like the TLS certificate management—until you've sorted out the more pressing issues.