August 22, 2025

Prompt like a pro: Recap of our prompting workshop with Zenbase

Prompt like a pro: Recap of our prompting workshop with Zenbase

This week, we hosted a live workshop with Cyrus Nouroozi, founder of Zenbase (YC S24), on the realities of prompt engineering for AI customer support. The session dug into what actually works when designing prompts that scale, from fundamentals to debugging strategies and future directions for the field.


Breaking prompt bottlenecks

Many teams hit the same wall: solving one prompt issue only to break something else. Support conversations tend to follow a predictable distribution:

  • 50–80% are common patterns that can be automated
  • 10–20% are long-tail, complex cases best handled by humans

The goal is not to automate everything, but to free human agents to focus on high-value, frustrating edge cases.


Systematic error analysis

To move beyond “prompt hell,” Cyrus emphasized the importance of error analysis:

  • Collect end-to-end traces
  • Build simple labeling UIs to mark good/bad outputs
  • Use LLMs to identify error patterns (e.g. missing context, lack of empathy)
  • Adjust prompts based on those patterns

Human-labeled examples then become few-shot training data for evaluation loops.


Iterative refinement strategies

Prompting is never one-and-done. Key practices include:

  • Collect failed cases via customer feedback and add to eval sets
  • Maintain regression test sets that must always pass
  • Monitor escalation rates as an early warning of overfitting

The cycle is simple but essential: try, fail, label, adjust.


XML tagging and model-specific techniques

One surprising insight: XML can outperform Markdown in certain cases where clear section boundaries are critical. Explicit <tags> create unambiguous boundaries for LLMs, making it easier to group related instructions or constraints.

Cyrus found that XML structure was especially effective for tasks requiring strict sectioning, such as “do/don’t” lists, where the model must clearly recognize where one instruction ends and another begins. In those cases, XML structure provided enough clarity that GPT-4.0 Mini outperformed GPT-4.0 on the same task.

Zenbase has even open-sourced a library to make XML prompting easier, with which you can compose dictionaries, objects, or arrays, then auto-convert them into structured XML.


Evaluating and debugging prompts

You cannot improve what you do not measure. Cyrus recommended:

  • Using binary rubrics (fail / good / not applicable)
  • Reviewing data directly rather than relying only on automated evals
  • Iteratively discovering requirements, as discussed in Shreya Shankar’s paper “Who Validates the Validators”

Future of prompt engineering

Cyrus sees a clear trajectory:

  • AI models will outperform most human prompt engineers
  • Higher-level “prompting languages” will emerge, just as programming evolved from assembly
  • Specialist prompt engineers may remain for edge cases, similar to CUDA kernel developers

Better UX is on the horizon: instead of writing prompts, users will provide conversations or goals, and models like Claude or GPT will generate structured prompts automatically.


Leveraging customer feedback

Customer feedback can serve as live error analysis, but requires careful filtering. Cultural differences mean thumbs up/down signals are not always reliable.

  • Small datasets → manually add to eval sets and tweak prompts
  • Large datasets → feed to an LLM to detect themes and patterns

Either way, feedback loops are critical to keeping prompts effective in real-world use.


Key takeaways

  1. Clear, structured prompts are the foundation of reliable AI support.
  2. Systematic error analysis prevents “prompt hell.”
  3. XML tagging and regression test sets are powerful tools.
  4. Feedback, both customer and human, is the engine for continuous improvement.
  5. Prompt engineering is moving quickly, and teams should expect higher-level abstractions and better tooling.

A big thank you to Cyrus for sharing his expertise, and to everyone who joined us live.