the client wasn't happy. the agent
learned why.

An evaluation architecture that taught a production AI agent to measure its own blind spots — turning declining satisfaction into a 32% improvement without a single line of new feature code.

built to fix what traditional QA couldn't measure

instinct

Satisfaction was declining. The data didn't show it yet. The conversations did. During staging reviews, the client's engagement shifted — shorter responses, repeated corrections, growing friction. No metric flagged it. But the pattern was clear to anyone paying attention. The AI agent was technically functional. It just wasn't listening well enough.

architecture

Two AI agents. One judges. One optimizes. The prompt improves itself. A golden dataset of 200+ human-evaluated conversations became the benchmark. A judge agent scores each conversation against it. When scores drop, an optimizer agent rewrites the prompt — with a memory layer to prevent regressions. The loop runs until the judge's confidence holds at 80%. Not higher — overtrained prompts break on real users.

impact

Client satisfaction rose 32%. No new features. Just a better listener. The improved prompt handled ambiguity, asked for clarification, and adapted to how clients actually speak — not how the spec said they would. The eval pipeline ran daily as new features shipped, keeping the prompt aligned with evolving behavior. The client called it a transformation. The architecture called it a loop.

1000+client conversations evaluated

32%satisfaction improvement

5team members

3moto MVP

~80%judge confidence target

the client wasn't happy. the agentlearned why.

the client wasn't happy. the agent
learned why.