As AI and ML integration ramps up in my organization, I've noticed our CI/CD pipelines are getting more complicated. It's not just about deploying apps anymore; we're faced with challenges like versioning large models (which aren't Git-friendly), monitoring model drift and performance, managing GPU resources, and ensuring security and compliance for AI services. Traditional DevOps tools seem inadequate for these ML-specific workflows, particularly regarding observability and governance. We've looked into tools like MLflow, Kubeflow, and Hugging Face Inference Endpoints, but creating a smooth, reliable pipeline feels hit or miss. So, I'm curious:
1. How are you adapting your CI/CD practices to accommodate ML workloads in production?
2. Have you found effective ways to automate monitoring and model re-training workflows with GenAI in mind?
3. What tools, patterns, or playbooks would you suggest? Thanks for any insights!
2 Answers
At my workplace, we started using Kubeflow for our ML workflows. It’s true there are better tools specifically for ML compared to traditional CI/CD ones, but the key is ensuring reproducibility and creating a sensible process for model improvement. Versioning a model means connecting it with the training and test data, codebase used, hyper-parameters, and performance reports. This is likely why you might feel Git isn’t compatible here. We use a combo of Git, DeltaLake, MLflow, and Airflow - Git for code, DeltaLake for versioning data, MLflow for logging training parameters and metrics, and Airflow for orchestration. While Kubeflow does encompass all of this, managing GPU/CPU/RAM resources in Kubernetes simplifies a lot of those concerns.
Honestly, I don't see much difference from traditional DevOps. Just treat model updates like software releases and make sure you're monitoring properly. For managing models that don't integrate with Git, we use S3 buckets for storage and reference the S3 URIs in our Git repos. Keeping models idempotent and never deleted has helped us dig back into previous versions when needed. It’s also beneficial to tag telemetry data with the model version and its 'age' because user behavior can change over time based on the model in use.
Do you use tools like Garak or PyRIT to check the models during CI/CD?
Thanks for the detailed breakdown! Since you're using MLflow and DeltaLake, have you encountered issues with scaling the MLflow Tracking Server for a lot of experiments or models? We're thinking about whether to self-host or opt for a managed solution.