We've encountered a serious issue in our production environment where a background worker managed to bypass our policy checks. While our main execution path was secured, this worker still had direct access to provider credentials from a previous prototype, which allowed it to make calls outside of our controlled environment. This led to a significant failure since a chunk of those calls lacked necessary identifiers, like `run_id` or `step_id`, which are needed for proper policy enforcement and auditing.
To address this situation, we centralized provider credentials behind a single execution path, blocked direct access to provider endpoints, rejected any requests without the required run identity, and set up alerts for calls that didn't go through the right channels. As a result, we saw a drastic reduction in shadow calls and restored audit reliability. I'm curious about what others are doing to prevent these bypass paths in their systems. Are you using egress controls, credential management strategies, or policies for admission?
3 Answers
We faced a similar situation when an older background job was found using hardcoded API keys to hit the provider directly, which was totally unmonitored. We noticed the issue when unexpected cost spikes occurred. To fix it, we introduced a lightweight proxy layer that issues short-lived scoped tokens for every execution. This way, the workers never hold onto long-lived credentials. If a call is made without a valid token, the proxy rejects it and sends out an alert. This approach also provides bonus cost attribution since each token is tied to a specific `run_id`. For us, securing egress was key; once we blocked direct provider access, those rogue calls dropped off completely.
This is definitely a common issue. Those prototype credentials tend to linger in production due to lack of audits. In our case, workers don't hold onto credentials at all; we inject them at runtime based on each worker's identity. If an old worker with outdated config starts up, it can't access anything since it has no credentials, meaning the checks pivot to verifying if the identity has a valid grant instead of just checking if it hit the right middleware.
Absolutely! Shifting the focus to identity verification versus just hitting middleware is a much stronger solution. How do you handle credential revocation for long-running workers? Do you use a short TTL with refresh tokens, or is it an immediate revoke per call?
Honestly, these bypass issues are more common than anticipated, particularly when leftovers from prototypes stick around. Centralizing access through a single execution layer is certainly one of the best practices. I’ve also seen teams implementing identity checks and automating monitoring with tools like Runable to detect any ungated calls early on.
Totally agree. Those old credential issues can sneak up on you. For your monitoring approach, do you primarily rely on egress rules, or do you track calls that lack execution identity?

That proxy idea sounds solid! Using tokens for cost attribution is a clever hack. How did you manage token expiration? Did you go for very short life spans, or did you make them long enough to cover the entire duration of a run?