I'm helping a small research team set up a private Kubernetes cluster and would love some advice. We have a storage server with 48 TB and several V100 GPU servers connected through gigabit Ethernet—no InfiniBand or parallel file systems. The main goal is to use this setup for model training with easy access via Jupyter Notebooks.
I'm considering deploying a Kubernetes cluster using k3s and have a preliminary plan that involves using Keycloak for authentication, Harbor for image management, and MinIO for object storage with access control to isolate user data.
However, I still have some questions:
1. What would be better for job orchestration: Argo Workflows, Flyte, or something else?
2. How can I enforce user limits and prioritize jobs like we do in Slurm?
3. Is there a way to mimic the qsub submission experience for a simpler user interface?
I have some Kubernetes deployment experience but little when it comes to setting it up as a shared compute cluster. Any advice would be fantastic!
1 Answer
I worked on a similar setup at the University of Turin using a multi-tenant Jupyter Notebook as a Service called Dossier. It worked well for us! I recommend checking out their approach.
That sounds interesting! Do they have any documentation or detailed components listed for their setup? I'd like to understand the system architecture better.