I'm working with a small research group that needs help setting up a private cluster. We have a storage server with 48 TB of space and several V100 GPU servers linked through gigabit Ethernet. We're not using InfiniBand or a parallel file system, and the main focus will be on model training with easy access through Jupyter Notebooks.
I'm considering deploying a lightweight Kubernetes cluster using k3s. Here's what I have planned so far:
- Keycloak for authentication
- Harbor for managing images
- MinIO for object storage with policy-based user data isolation.
However, I have some unresolved questions:
1. What's the best choice for job orchestration? Should I use Argo Workflows, Flyte, or something else?
2. How can I implement resource scheduling that enforces per-user limits and job priorities, similar to how Slurm functions?
3. Any tips on creating an HPC-like user experience, perhaps a qsub-style job submission?
I have some experience with deploying apps on Kubernetes but no experience in managing it as a shared compute cluster. Any advice would be greatly appreciated!
1 Answer
I faced a similar challenge at the University of Turin with a system called Dossier, which is a multi-tenant Jupyter Notebook as a Service. You might want to look into that for inspiration!
That sounds interesting! Do you know if there's any documentation for Dossier that outlines its key components? I’d love to dive deeper.