I'm helping a small research team set up a private Kubernetes cluster and would love some advice. We have a storage server with 48 TB and several V100 GPU servers connected through gigabit Ethernet—no InfiniBand or parallel file systems. The main goal is to use this setup for model training with easy access via Jupyter Notebooks.
I'm considering deploying a Kubernetes cluster using k3s and have a preliminary plan that involves using Keycloak for authentication, Harbor for image management, and MinIO for object storage with access control to isolate user data.
However, I still have some questions:
1. What would be better for job orchestration: Argo Workflows, Flyte, or something else?
2. How can I enforce user limits and prioritize jobs like we do in Slurm?
3. Is there a way to mimic the qsub submission experience for a simpler user interface?
I have some Kubernetes deployment experience but little when it comes to setting it up as a shared compute cluster. Any advice would be fantastic!
0 Answers
There is no answer to this question yet. If you know the answer or can offer some help, please use the form below.
Related Questions
How To Get Your Domain Unblocked From Facebook
How To Find A String In a Directory of Files Using Linux