I'm a developer working on a tool for auditing and deploying AI agents, but I've hit a roadblock. Traditional continuous integration/continuous deployment (CI/CD) methods seem inadequate for handling AI agents, particularly when it comes to issues like behavioral regressions that can occur due to prompt drift or model updates. If you're in a similar situation using large language models (LLMs) in production, how do you treat prompts? Do you consider them as configuration files, like Helm charts or environment variables, or do you treat them as part of the code itself? Also, if one of your agents begins hallucinating in production, is your current pipeline capable of implementing prompt version changes without necessitating a complete redeployment?
5 Answers
What exactly is your AI handling? Deploying AI in production is a serious decision. I've seen some wild cases where the same prompt can yield completely different results, which is super risky! I’d urge anyone deploying AI to have a robust understanding of what they're working with before going live.
We handle prompts as separate versioned assets in a dedicated repository. We created a prompt registry that lets us change versions without modifying the main deployment; however, you still need safeguards. We discovered that even small tweaks to prompts require extensive testing, sometimes more than code changes, because they can drastically affect behavior in ways that conventional analysis tools miss. To mitigate risks, we’ve been running shadow deployments where new prompts run alongside the existing prod setup before switching completely.
Managing prompts in a separate repo sounds plausible, but why separate from the main agent codebase? Isn’t that a hassle when developing agents? Also, do you have regression tests in place? I think if you don't, shadow deployments won't help much.
It's bold to deploy AI agents in production. Anything AI touches should be handled through a controlled server or executed with reliable scripts. AI agents are great for testing robustness, but I utilize pre-commit hooks to prevent CI/CD from running too hastily. The mantra is to fail fast and safely.
Good points, but OP isn’t asking about servers or Git hooks. Are you suggesting they should've used a controlled server just for prompts or something similar?
I think prompts should definitely be treated as code. If your AI starts hallucinating, switching prompts on the fly isn’t usually the way to go unless you're just patching things up. You need a solid framework in place instead of making quick changes that can introduce even more issues with generative AI.
You should absolutely treat prompts as configuration files when in production. Setting them up in a registry allows your agents to access them easily. About swapping prompts on the fly? Ideally, you shouldn’t need that in production. You should identify prompt drift causing behavioral shifts well before any deployment. Are your tools used in the correct order? Catch these issues early on from tests using effective evaluation frameworks and proper metrics tracking. If a hot swap is essential due to issues, consider switching the whole agent instead; think of it like blue-green deployments.

If you’re pushing back, make sure it's about fixing the AI's architecture rather than entirely blocking AI deployment. Variance in responses for identical prompts suggests the AI’s foundational instructions are shaky. You need strict controls and tuned configurations to prevent that.