Seeking Feedback on My GitLab Runner Scheduling System and Trust Model

0
13
Asked By TechGuru27 On

I've created a system that uses a margin-based approach to schedule jobs on GitLab runners. The system calls Claude AI when two runners are equally matched within a 15% scoring margin. My goal is to improve the scheduling efficiency since the default FIFO method isn't prioritizing production deploys effectively. Right now, I have four Python agents monitoring runner performance, analyzing job scores, making assignments, and optimizing performance. I'm particularly interested in feedback on my trust model for production deployments: it has three tiers of automation from advisory, supervised, and fully autonomous. I have some specific questions I'd love feedback on, including the appropriateness of the 15% margin and how to measure trust in my decision-making process. I'm open to all insights, especially from those experienced with scheduling systems at scale!

5 Answers

Answered By CI_Fanatic88 On

Have you considered using tags to select runners directly from the job definition? It seems like the root of your problem might be that you're not optimizing build and pipeline times instead of adding additional layers.

TechSavvySolver -

That's a valid point! GitLab does use tags to choose runners, but my challenge lies in picking which runner should actively handle the job based on their current workload, not just if they're capable. Even optimizing everything still results in bottlenecks when both high and low priority jobs are queued simultaneously.

Answered By CautionaryDev On

You are adding another layer in front of your runners which could become a single point of failure. Have you considered how this affects your system's reliability? Plus, even with a 2-3 second scoring process, every job will wait longer.

TechGuru27 -

You're right about the potential risks! I designed the system to work non-blocking—if it fails, jobs revert to GitLab’s traditional scheduling. I agree on the latency concern, but for most jobs, only a small percentage will require that extra scoring time. I'm also thinking about documenting ways to gracefully handle failures.

Answered By EconomicsOfCI On

Interesting setup, but have you looked into the cost of running a dedicated runner for each job versus the savings from your current system? It’d be good to know the total cost of ownership in your case.

BudgetWiseDev -

Great point! My solution is aimed at teams that stick with fixed fleets instead of Kubernetes. The system could help manage prioritization, which Karpenter primarily manages for node provisioning. With our approach, there's potential for significant cost savings with reduced queue wait times, plus it remains free and open source.

Answered By CloudNinja201 On

If you're using Kubernetes with auto-scaling, it seems like this wouldn't even be a concern. Every job I've run has basically started right away. What’s different about your setup?

Answered By DevOpsDude99 On

While using AI for this is an interesting approach, I feel like the decisions should be more metric and code driven. Can you clarify what advantage Claude brings when runner scores are tight? Seems like a more straightforward algorithm could solve it without AI.

SmartScheduler42 -

The rules determine which runner is the best fit and Claude provides the rationale for the tough calls. For example, when scores are close, Claude looks at factors like a runner's recent performance and if jobs are part of the same pipeline, guiding a more informed decision than just numbers.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.