.jpg)
September 11, 2025
This week, we hosted a live workshop on how simulation and evaluation can transform fragile prototypes into production-ready AI support agents. Brooke Hopkins, founder and CEO of Coval (YC S24), shared lessons from her time at Waymo and her work building Coval’s simulation and evaluation platform for conversational AI.
Brooke explained that the idea for Coval came directly from her experience leading evaluation infrastructure at Waymo. Many of the reliability problems facing conversational AI today mirror those solved in self-driving: how to achieve high reliability in non-deterministic systems. The key is building infrastructure that allows agents to be both autonomous and dependable, much like cloud computing achieves “six nines” of reliability despite unreliable hardware and networks.
Autonomous driving and conversational AI share a common architecture: multiple models chained together to produce a working system. In driving, perception, prediction, and planning models must align. In voice AI, speech-to-text, turn detection, LLMs, and speech synthesis work together. Both domains require testing probabilistic outputs at scale, measuring reliability across thousands of interactions rather than fixed outputs.
Evaluating AI in support comes with two categories of risks:
Best practice: test for both expected success and intentional failure.
Coval’s approach includes replaying production conversations to test new system versions. Teams can ask: If the same customer came today, how would the agent respond?
For earlier stages without production data, scenario prompts can generate thousands of diverse conversations, leveraging non-determinism in tone, phrasing, and flow to stress-test systems.
Support teams often discover that fixing one issue breaks another. Brooke stressed that preventing regressions is more important than first-time fixes, since customers lose trust quickly when issues reappear. Coval addresses this by treating customer-reported issues as ongoing test cases. Every prompt or workflow change must pass those tests before deployment, ensuring reliability scales with growth.
Evaluating chat agents is simpler than voice. Voice requires additional steps such as transcription, synthesis, and timing that introduce cascading points of failure. Mishearing a word can derail an entire conversation. Timing sensitivity also matters: a one-second delay is acceptable in chat but disruptive in voice. Evaluations must account for stricter performance requirements and higher data complexity.
As AI agents expand beyond text and voice, evaluation will follow. Multimodal systems that combine text, voice, images, and video will introduce richer signals and new challenges. Just as voice went from “awkward” to increasingly natural, video interactions will follow, demanding new evaluation frameworks to measure reliability across more complex contexts.
A big thank you to Brooke for the insights, and to everyone who joined us live.