I'm part of a sysadmin team at a public organization that's transitioned to using Kubernetes instead of our old virtual machines. We're focused on high-performance computing (HPC), where some of our nodes perform heavy calculations. While I have some Kubernetes experience, this is my first time dealing with HPC environments. We strictly utilize open-source software and operate in an air-gapped network.
I'm looking for advice based on your experiences regarding provisioning workloads like this. Right now, we're using Puppet and Foreman to set up our bare-metal nodes. After that, we employ the Kubernetes Puppet module to configure the cluster, but it's outdated and lacking key features.
We initially thought about using Cluster API (CAPI) for cluster lifecycle management, but ran into issues interfacing with our infrastructure. Our aim is to keep our infrastructure as code (IaC) approach using Puppet for the OS and user setups like Kerberos.
I've experimented with Metal3, Ironic, and Kubeadm alongside Puppet, but it turned into quite a mess. I had some success with k0s, but it felt too new for my comfort. Lately, I've been looking into Rancher with RKE2 for provisioning on existing nodes, but I'm wary due to past negative experiences.
Our team has a solid Unix/Linux background, though we're a bit green on containers and orchestration. I'd appreciate any insights or recommendations you might have!
4 Answers
Totally agree, skipping traditional configuration management in favor of something like Talos seems like the best option. It simplifies everything right from the start, especially if you're aiming for high performance.
Honestly, I think Talos is the way to go. It seems like the best approach instead of dealing with old configuration management. If Kubernetes is running on a node, there really isn’t much else to manage. Why complicate things?
You might want to consider moving to Talos. It’s designed to minimize the burden of managing the base OS, enhancing both security and maintainability. It could streamline your operations by reducing unnecessary access management on Kubernetes nodes. We have bare metal infrastructure providers that operate similarly to CAPI without the Kubernetes constraints, which could simplify things for you.
If you have control over the nodes and the L2 network, Talos with NetBoot could be a game changer. Your nodes could automatically boot up when needed. It’s worth considering how many clusters you're dealing with. For about 30 clusters with 20-40 nodes each, this could make life easier.

Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures