I've created a system that uses a margin-based approach to schedule jobs on GitLab runners. The system calls Claude AI when two runners are equally matched within a 15% scoring margin. My goal is to improve the scheduling efficiency since the default FIFO method isn't prioritizing production deploys effectively. Right now, I have four Python agents monitoring runner performance, analyzing job scores, making assignments, and optimizing performance. I'm particularly interested in feedback on my trust model for production deployments: it has three tiers of automation from advisory, supervised, and fully autonomous. I have some specific questions I'd love feedback on, including the appropriateness of the 15% margin and how to measure trust in my decision-making process. I'm open to all insights, especially from those experienced with scheduling systems at scale!
5 Answers
Have you considered using tags to select runners directly from the job definition? It seems like the root of your problem might be that you're not optimizing build and pipeline times instead of adding additional layers.
You are adding another layer in front of your runners which could become a single point of failure. Have you considered how this affects your system's reliability? Plus, even with a 2-3 second scoring process, every job will wait longer.
You're right about the potential risks! I designed the system to work non-blocking—if it fails, jobs revert to GitLab’s traditional scheduling. I agree on the latency concern, but for most jobs, only a small percentage will require that extra scoring time. I'm also thinking about documenting ways to gracefully handle failures.
Interesting setup, but have you looked into the cost of running a dedicated runner for each job versus the savings from your current system? It’d be good to know the total cost of ownership in your case.
Great point! My solution is aimed at teams that stick with fixed fleets instead of Kubernetes. The system could help manage prioritization, which Karpenter primarily manages for node provisioning. With our approach, there's potential for significant cost savings with reduced queue wait times, plus it remains free and open source.
If you're using Kubernetes with auto-scaling, it seems like this wouldn't even be a concern. Every job I've run has basically started right away. What’s different about your setup?
While using AI for this is an interesting approach, I feel like the decisions should be more metric and code driven. Can you clarify what advantage Claude brings when runner scores are tight? Seems like a more straightforward algorithm could solve it without AI.
The rules determine which runner is the best fit and Claude provides the rationale for the tough calls. For example, when scores are close, Claude looks at factors like a runner's recent performance and if jobs are part of the same pipeline, guiding a more informed decision than just numbers.

That's a valid point! GitLab does use tags to choose runners, but my challenge lies in picking which runner should actively handle the job based on their current workload, not just if they're capable. Even optimizing everything still results in bottlenecks when both high and low priority jobs are queued simultaneously.