I'm in the process of rethinking my home cluster setup, which currently consists of three Raspberry Pis running a small bare metal Talos cluster. I've been wondering if there are significant trade-offs in stability and performance when comparing setups that merge the worker nodes with the control plane, versus those that keep them separate. I've experienced some slow recovery times after node failures and I'm considering the idea of integrating some Raspberry Pis into my cluster to see if they could enhance reliability. Additionally, I've thought about implementing a configuration with two control plane Pis along with three worker/control plane nodes to improve fault tolerance. Most discussions I've read online focus on larger clusters and mention issues like noisy neighbors, which don't apply to my single-user scenario. Virtualization seems to be popular, but it feels a bit unnecessary since Kubernetes should inherently manage fault tolerance. I'm open to any suggestions on how to create the most resilient and low-power home lab setup.
3 Answers
In my setup, I have a three-node HA configuration with control planes running in Proxmox, and I plan to add six Orange Pi CM5s for worker roles. If I were in your shoes, I'd consider picking up some used small form factor computers for your control plane nodes instead of additional Pis; they might offer better reliability.
It sounds like you're using Raspberry Pis for your Talos cluster? Which models are you on, and are you aware of the compatibility issues with Talos on the Pi 5s? I'd recommend checking that before making any purchases.
I was planning on getting the Pi 5s, but I didn't know there were issues with Talos! Thanks for the heads up!
When it comes to control planes, it's crucial to have an odd number to prevent split quorum issues. Using just two control planes can actually make your setup less resilient, since if one fails, you'll be stuck in a situation where neither can achieve a majority. Aim for at least three control planes for high availability.
That makes sense! So having five might be ideal if you're looking to sustain two-node fault tolerance. Just remember that you want to keep it odd for stability!

That's really helpful! I'm curious about your experience; what's the actual benefit of separating the control plane and workers in a three-node cluster? Do you find that recovery times are improved?