I'm exploring whether a hybrid setup using Cluster API is feasible. We have Tenstorrent Galaxy servers equipped with GPUs for LLM inferencing, and I'm considering a hybrid model where the control plane runs on AWS, alongside regular worker nodes for KServe and monitoring. Meanwhile, I'd like to use Cluster API on metal3 to manage the Galaxy servers. Is this approach doable? Additionally, I'm curious if the EKS hybrid node option could work for us.
3 Answers
You're correct in thinking that Cluster API typically focuses on a single infrastructure provider. While having a control plane in the cloud and worker nodes on-premises is feasible, you'll need to ensure you're using an on-prem infrastructure provider for your setup. Have you figured out what control plane provider you're considering? Also, is it just the one cluster you’re targeting?
From what I gather, it seems like you're leaning towards an implementation of Cluster API, but it's worth noting that it has a limitation. The documentation mentions they don't support managing a single cluster across different infrastructure providers. So, if I'm reading your situation right, I'd say it's probably not advisable to proceed with that setup. Good luck, though!
Thanks for the insight! That makes sense.
The architecture you're looking for isn't fully supported by CAPI, but there are ways to make aspects of it work. At Sidero, we moved away from CAPI to develop a solution for these needs – check out Talos Linux and Omni. We’re planning to support Tenstorrent drivers soon, but currently, we lack an automatic resource provisioning option for AWS or metal3. Keep in mind, EKS hybrid nodes can really add up cost-wise. We run our production in AWS while having worker nodes on bare metal, which has worked for us.
Yes, it's just one cluster. We are still deciding on the control plane provider. We did consider EKS hybrid nodes, but it looks like that won't meet our autoscaling needs.