I'm new here and looking for insights on how teams are currently testing AI agents before they go live. What does your testing pipeline usually include? Are you following practices like CI-gated tests, prompt mutation or fuzzing, manual QA, or just hoping for the best? I'm trying to understand how reliability testing fits into real engineering workflows so I can avoid over-engineering a solution that might not be necessary. Also, I'm involved with Flakestorm, an open-source project focusing on agent stress testing and would love to hear some real-world experiences.
5 Answers
It seems like most teams are basically operating on a 'ship and pray' model, with a dash of 'users found a bug' added in for flavor. If you're fortunate enough to find a team doing CI-gated testing, you might be looking at a well-funded startup or a heavily regulated financial company. Honestly, testing AI agents is still pretty murky since the outputs can be unpredictable, but your tool could definitely help tackle that issue.
Testing AI agents is tricky, especially since they can hallucinate about 30% of the time. It's key to recreate the environment completely each time you test; just updating and restarting can cause drift. It's really a trial-and-error process to get reliable results, so your approach could really stand out.
Most teams don’t really do anything formal for testing. However, some are using 'evals,' which are basically integration tests for AI workflows. While it’s definitely tougher in a non-deterministic environment, there are still ways to verify results. For example, you can prompt the model for specific answers and validate them directly. Sometimes people use another LLM to evaluate responses as 'right' or 'wrong.' It's a developing area, but manageable with the right techniques.
Thanks for this tip! I'll check out the concept of evals—sounds helpful.
Some teams have AI agents that can measure based on the models they train, making it easier to verify results. But when it comes to LLMs, I'd suggest just chatting with it for a while to ensure its responses make sense. You can get a feel for how well it understands before putting it into broader use.
For my case, it's definitely more about trial and error. It's frustrating, but I’m hoping there’s a systematic solution coming along.

Yeah, I totally agree. One small change in a prompt can lead to dramatically different outputs, making it quite frustrating. Finding a solid testing method is crucial.