Evals Are Product Management

Colorful sticky notes pinned on a corkboard

I used to think evaluation was a research chore. Then I shipped an agent.

In the early days, we measured progress with what I can only describe as vibes. We would run a few prompts, nod, and call it a release. A week later, a user would send a screenshot of the agent confidently doing the wrong thing, and we would scramble to reproduce it. That loop felt familiar and harmless until the product started to grow.

Agents magnify ambiguity. They touch more tools, make more decisions, and run longer. The difference between a good day and a bad day is rarely a single bug. It is usually a small mis-spec that gets amplified across steps. At scale, vibes are not a strategy. They are a liability.

That is when evals became product management for me.

Evals As Product Specs

A good eval is not just a metric. It is a statement of intent. It tells the agent, and the team, what the product is supposed to be reliable at. That is what product specs do.

When I write evals now, I am effectively writing the smallest possible product brief:

What user intent matters most?
What constraints are non-negotiable?
What does a correct outcome look like?
What kinds of failures are unacceptable even if the task is completed?

Once I started framing evals this way, everything got easier. It was no longer about scoring models. It was about shipping reliability.

The Day I Stopped Trusting Vibes

One release taught me the lesson clearly. We added a new tool for structured search. The agent became faster and appeared more confident. Our spot checks looked good. Then a user reported that the agent mixed two different customer accounts when summarizing a support thread.

The bug was subtle: a single tool call was returning results from a cached session, and the agent used them without verifying context. The agent was not "wrong" in a classic sense. It was just overconfident.

That incident was not a test gap. It was a spec gap. We had never encoded the rule: never cross-account data in a single answer. Once we added it as an eval, we never missed it again.

That is the PM moment. Evals are where you turn a failure story into a durable requirement.

A GAIA-Inspired Ladder For Eval Difficulty

To scale evals, I borrowed a structure from GAIA. The key idea: not all tasks are equal, so evaluate them in levels.

Level 1: Deterministic, single-step tasks. These are the tasks where the answer is objective and verifiable. For example:

Given a list of line items, compute the final invoice total with tax.
Convert a short paragraph into a JSON object with fixed keys.
Extract the top three deadlines from a project brief.

I started with these because they are cheap to create and easy to verify. They build confidence and reveal basic tool reliability.

Level 2: Multi-step reasoning and computation. These tasks look more like real agent workflows. An example I used:

The input is a messy meeting transcript and a CSV of open tickets.
The agent has to identify owners, map action items, and update a task list.
The expected output is a clean table with the correct owners and due dates.

Level 2 is where you start to see fragile reasoning and tool integration problems. It is also where regressions hide.

Level 3: Fuzzy instructions with ambiguity. These are the tasks that force the agent to interpret intent. For example:

"Clean up this support summary and make it suitable for a CEO update."
"Draft a short reply to this partner and keep the tone warm, but direct."
"Summarize this thread and highlight only the blockers."

Level 3 is where most product teams want to use LLM-as-a-judge. But it is also where it is easiest to fool yourself.

This ladder did two things for me: it kept eval creation disciplined, and it made it obvious which failures were acceptable for now and which were not.

A notebook with a handwritten weekly schedule and a pair of glasses

Ground Truth Is The Price Of Fairness

My biggest takeaway from designing eval tasks is this: you cannot be fair without ground truth.

For Level 1 and Level 2, ground truth is straightforward. You can precompute the correct answer or validate against a known system of record. For Level 3, it gets trickier. The answer may not be a single string. But you can still define fairness with explicit invariants:

Required facts that must appear.
Facts that must not appear.
A formatting constraint that keeps the output usable.
A tone constraint that can be scored with a rubric.

In one eval, I asked the agent to summarize a messy internal thread for a weekly update. The ground truth was not a perfect sentence. It was a checklist of four required points and two forbidden leaks. If all six constraints passed, the answer was correct. If any failed, it was not.

That structure mattered more than any judge model. The ground truth made the judgment explicit and repeatable.

Failure Taxonomies Turn Chaos Into Roadmaps

When you run evals at scale, you stop seeing failures as random. You see them as patterns. That is where the failure taxonomy earns its keep.

Here is the taxonomy I keep returning to:

Instruction following: missed a key constraint or ignored a requirement.
Tool misuse: called the wrong tool, used stale context, or skipped verification.
Reasoning errors: a step chain breaks mid-way or assumes a false dependency.
Retrieval misses: failed to fetch the relevant document or used a weak query.
Formatting failures: output could be correct, but unusable in the product.
Safety or privacy slips: leaked information across contexts.

Once you label failures this way, roadmapping becomes simpler. You can decide whether to invest in tool reliability, prompt clarity, or model routing. The taxonomy turns a pile of bugs into a plan.

Golden Tasks And Regression Checks

The best thing I added to our process was a small set of golden tasks. Think of them as unit tests for the product experience.

A golden set should be small, stable, and representative of the moments you never want to break. For us it included:

A multi-step customer support summary with strict privacy boundaries.
A spreadsheet update task that requires correct numeric computation.
A fuzzy rewrite that must preserve a specific tone and fact set.

We ran these in CI. If the golden tasks failed, the release paused. It was not glamorous, but it reduced panic. Regression checks did more for shipping confidence than any leaderboard.

Why I Am Skeptical Of LLM-As-A-Judge

I still use LLM-as-a-judge, but I do not trust it as the primary arbiter.

There are two reasons, and they are both practical.

First, standard problems with ground truth already give you a lot. Most teams under-invest in clean tasks and over-invest in complex judges. If you can define the right tasks, you can measure progress without subjective scoring.

Second, the judge paradox is real. To trust the judge, you need a judge that is clearly stronger than the agent on the task. Otherwise, you are just asking a similar model to grade itself. That is not evaluation. That is theater.

The nuance is that a judge does not have to be stronger on the task itself. It has to be consistent on the rubric. But consistency is hard to prove. So I treat the judge as a helper, not a referee.

Where LLM Judges Still Help

There are situations where LLM judges are useful, especially in Level 3 tasks.

I use them in three ways:

Rubric scoring: the judge scores a response against a checklist of required facts.
Equivalence detection: the judge flags answers that are semantically correct but phrased differently.
Triage: the judge highlights borderline cases for human review.

Even then, I prefer to separate model families. If the agent and judge are too similar, the judge will validate the agent's mistakes. A weak, biased judge is worse than no judge.

The Practical Playbook I Use Now

If I have to summarize my eval workflow as PM practice, it looks like this:

Start with a failure taxonomy and keep it updated.
Build a GAIA-style ladder so you do not only test the easy cases.
Write ground truth as constraints, not just strings.
Curate a golden set and run it on every release.
Track regressions, not just aggregate scores.

This feels less like model evaluation and more like product assurance. That is the point.

Closing Thought

Agents make it easy to ship quickly and dangerously. Evals are how I slow the right parts down. If I had to describe product management for agents in one sentence, it would be this: turn failures into tests, and tests into truth.