I'm working on deploying mini-data centers that are designed for heat reuse, located in challenging environments where dust, vibration, and poor connectivity are common. We're focused on Industrial IoT and edge computing. We primarily use K3s for orchestration and gather data from various sources like IT workloads and MQTT. Uptime is absolutely critical since we can't depend on perfect infrastructure. I'm considering two hardware options: a single high-end rugged server, which minimizes the physical footprint, versus a 3-node cluster made up of cheaper industrial PCs for high availability. I'm really looking for the most reliable way to run Kubernetes at the edge and would love to hear if Kubernetes is a suitable choice for our needs. Thanks for any insights!
5 Answers
I highly recommend going for the multiple node approach. If you’re running three control plane nodes, you can still function even if one node goes down, although it will be in a degraded state. Plus, it gives you more time to address issues that could arise later. You might also want to set up some priority classes and taint tolerations to keep your critical workloads operational.
A 3-node cluster is often better for uptime, especially if you're in a rough environment. Chick-fil-A uses a similar setup for their point-of-sale systems and seems to be satisfied with it. Keeping it redundant with an HA system can really pay off in critical situations. Definitely take a look at their approach—I think you'll find valuable insights there!
That sounds like a solid reference. I'll definitely check into how they manage their systems!
I've been in a similar situation, using a single server for scientific data collection. I've faced quite a few challenges with downed nodes due to networking and power issues. As a workaround, I now have cold spares located at more remote sites. It might be worth looking into using one node with a cold spare, primarily to save costs on hardware, provided you have a robust system for determination of failures.
That sounds like a reasonable solution! I'd prefer to avoid unnecessary trips to far locations as well. What strategies do you use to ensure you only go when there's a real need for a repair?
Going with a single high-end server might seem efficient at first, but it creates a significant risk of total failure. A 3-node cluster spread across cheaper nodes is safer and offers better redundancy. Just ensure your physical security is tight too, as harsh environments can lead to someone tampering with equipment. Also, consider a router with WWAN as a backup to maintain a stable connection.
Thanks! I'm already using LTE for failover, but I agree, finding a reliable cellular signal in remote areas is tough. Any suggestions for a more straightforward failover setup?
If your edge sites have decent connectivity, you might consider keeping the control plane in the cloud and just using worker nodes at your physical locations. That could simplify your setup while still keeping you operational.

That makes perfect sense! It’s comforting to know that losing a machine doesn’t mean total downtime.