From simulation to support: Recap of our evals workshop with Coval

This week, we hosted a live workshop on how simulation and evaluation can transform fragile prototypes into production-ready AI support agents. Brooke Hopkins, founder and CEO of Coval (YC S24), shared lessons from her time at Waymo and her work building Coval’s simulation and evaluation platform for conversational AI.

Founding Coval

Brooke explained that the idea for Coval came directly from her experience leading evaluation infrastructure at Waymo. Many of the reliability problems facing conversational AI today mirror those solved in self-driving: how to achieve high reliability in non-deterministic systems. The key is building infrastructure that allows agents to be both autonomous and dependable, much like cloud computing achieves “six nines” of reliability despite unreliable hardware and networks.

Lessons from self-driving

Autonomous driving and conversational AI share a common architecture: multiple models chained together to produce a working system. In driving, perception, prediction, and planning models must align. In voice AI, speech-to-text, turn detection, LLMs, and speech synthesis work together. Both domains require testing probabilistic outputs at scale, measuring reliability across thousands of interactions rather than fixed outputs.

Risks and pitfalls in evaluating support agents

Evaluating AI in support comes with two categories of risks:

Execution risks: Does the agent follow the right steps consistently? Key metrics include latency, workflow adherence, and tool-call accuracy.
Adversarial risks: Can the agent be pushed off track? Jailbreaking, issuing unauthorized refunds, or hallucinating actions (for example, confirming an appointment but failing to book it) are common pitfalls.

Best practice: test for both expected success and intentional failure.

Simulating conversations at scale

Coval’s approach includes replaying production conversations to test new system versions. Teams can ask: If the same customer came today, how would the agent respond?

For earlier stages without production data, scenario prompts can generate thousands of diverse conversations, leveraging non-determinism in tone, phrasing, and flow to stress-test systems.

Preventing regressions in support

Support teams often discover that fixing one issue breaks another. Brooke stressed that preventing regressions is more important than first-time fixes, since customers lose trust quickly when issues reappear. Coval addresses this by treating customer-reported issues as ongoing test cases. Every prompt or workflow change must pass those tests before deployment, ensuring reliability scales with growth.

Voice vs. chat agents

Evaluating chat agents is simpler than voice. Voice requires additional steps such as transcription, synthesis, and timing that introduce cascading points of failure. Mishearing a word can derail an entire conversation. Timing sensitivity also matters: a one-second delay is acceptable in chat but disruptive in voice. Evaluations must account for stricter performance requirements and higher data complexity.

The future of evaluation

As AI agents expand beyond text and voice, evaluation will follow. Multimodal systems that combine text, voice, images, and video will introduce richer signals and new challenges. Just as voice went from “awkward” to increasingly natural, video interactions will follow, demanding new evaluation frameworks to measure reliability across more complex contexts.

Key takeaways

Reliability must be engineered into AI support agents from the start.
Evaluation must measure both correct execution and resilience against adversarial failures.
Regression prevention is critical for maintaining customer trust.
Evaluation will evolve alongside multimodal agents, covering new interaction types such as video.

A big thank you to Brooke for the insights, and to everyone who joined us live.

From simulation to support: Recap of our evals workshop with Coval

Founding Coval

Lessons from self-driving

Risks and pitfalls in evaluating support agents

Simulating conversations at scale

Preventing regressions in support

Voice vs. chat agents

The future of evaluation

Key takeaways

Keep reading

Automating TikTok Shop at Scale: Building a Scalable Creator Engine

Oneleet

AI can write the code if you build the right harness.