Hey everyone! I'm working on a project during the holidays and could use some insights from those who have experience managing EKS at larger scales. My issue is that EKS and EC2 costs are still pretty hefty despite implementing common optimizations like spot instances, autoscaling, and rightsizing. I'm exploring a hybrid setup where I keep the EKS control plane in AWS but run worker nodes on cheaper infrastructure outside of AWS, like bare metal servers from providers such as Hetzner. I'd also like to use EC2 for bursts, managed through Karpenter. However, I've come across some concerns like network latency, security, and manageability. I'd love to hear if anyone has experience with production workloads on off-AWS workers while keeping the control plane in AWS. What do you think the main challenges are?
4 Answers
Working with hybrid setups is fairly common. At my company, we run the control plane on EC2 and use a tool called Talos which simplifies things a lot. We create a secure tunnel between AWS and our on-prem workers, which allows us to save on costs and provides flexibility. This setup permits us to burst to AWS anytime there's a capacity issue, and it’s a great way to optimize costs while still keeping everything manageable.
Running EKS with off-AWS nodes seems risky to me. Imagine it failing in the middle of the night because of connectivity issues — that can quickly escalate costs! You might face serious problems with networking and security setup too. I advise going for a separate lightweight cluster and handling batch jobs there while keeping your main workloads in EKS. You'll save a lot of headaches in the long run.
You can definitely achieve a similar setup without relying solely on AWS. I have a Wireguard node VPS that connects my home server to a public IP, allowing me to run a k3s cluster at home. It works wonderfully for low-traffic sites! This kind of approach could provide the flexibility and cost savings you're looking for without being tied into AWS.
I've considered this too, and there are definitely some downsides. For critical applications, local latency can be a real issue since response times matter a lot. While some batch processes could work on remote nodes, you'd need a robust infrastructure to handle them efficiently. If running on local machines isn't viable, why not stick with spot instances? They’re often much cheaper, and compliance with data regulations could be a hurdle. Overall, unless you have spare machines and serious cost issues, I don't see it being worth the hassle.

Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures