Optmize for pass@k

Black laptop computer on a brown wooden table

I keep returning to a small idea from Claude's pass@k discussion in Demystifying evals for AI agents: the first success is disproportionate. A single clean run can change a product's narrative, even if the underlying system is still fragile. That is why I think early agent products should optimize pass@k, not pass^k. Later, when the product becomes part of a real workflow, the order flips.

This is not just about evaluation math. It is about product psychology.

The Moment I Started Thinking In pass@k

I remember watching early demos of agents like Manus. The success rate was not great, and the failures were very real. But when it worked even once, it felt like the future had already arrived. A single run could show: the agent can open a browser, navigate a real site, complete a task, and report back. That single run was more convincing than ten careful disclaimers.

That is the pass@k moment: the product earns belief the first time it lands. People do not average their emotional response across ten tries. They lock onto the best try and imagine it as the baseline. That is how early agents get their first wave of attention.

What pass@k Actually Measures

I like to translate these metrics into plain language. pass@k is the probability that the agent succeeds at least once in k attempts. pass^k (as it is often framed in the same discussion) is the probability that it succeeds every time across k attempts. It is the difference between "can it pull this off once?" and "can I rely on it every time?"

Those two questions map to two different product stages:

pass@k maps to discovery and belief.
pass^k maps to trust and efficiency.

Both are necessary. But the timing matters. In the early stage, you need belief before you can earn trust.

I like to sanity-check the difference with a toy example. If pass@1 is 30 percent, then pass@3 is roughly 66 percent because you only need one success. That feels surprisingly good in a demo. But pass^3 is about 2.7 percent, because the user needs three wins in a row. Same system, completely different lived experience. This is why teams can argue past each other if they do not name which metric they are optimizing.

Why pass@k Creates Viral Gravity

The internet rewards the exceptional case. A single successful run becomes a clip, a screenshot, a post. That post spreads because it is surprising, not because it is representative. Viral distribution has the same shape as pass@k: "at least one incredible outcome is enough to trigger a share."

That is why products like Clawdbot can go viral even if the average run is noisy. If one user manages a delightful WhatsApp conversation with an agent, it is enough to create a story that travels. The success feels like evidence, and the failure is discounted as "early days."

This is a feature of the market, not a bug. If you are shipping a new agent, chasing pass@k is not just an evaluation decision. It is a growth strategy.

A calendar and a pair of glasses on a tabletop

The Cost Of Living On One Win

There is a shadow side to pass@k. The same variance that creates wow moments also creates frustration the second time. The user's expectation is anchored to the best run, and every failure after that feels like a regression.

This is why early agent products feel magical and then brittle. The first demo is a peak. The second attempt is a reminder that the system is probabilistic. Users will forgive that early on. They will not forgive it once the agent is integrated into real work.

I have seen this play out in pilots. A leadership team sees the highlight reel and signs off on a trial. The frontline team runs the agent ten times in a row and builds a failure log. The delta between those two experiences is what kills momentum. That gap is not a marketing problem. It is a pass^k problem.

In other words, pass@k is a marketing asset but a product liability. It is the metric of the demo, not the metric of the day-to-day workflow.

pass^k Is The Metric Of Trust

When I think about products like Claude Code or Codex, I do not want a single miraculous run. I want the boring guarantee: every time I run the same task, the agent should complete it, quickly, with minimal supervision. That is pass^k.

Pass^k turns the system into infrastructure. It does not just measure accuracy; it measures the stability of the user's time. If the agent's success rate fluctuates, the user ends up doing manual checkpoints, re-tries, and verification. That overhead kills the productivity promise.

So pass^k is not only about correctness. It is about confidence and flow. The user stays in motion because the agent stays reliable.

The Maturity Crossover

There is a natural pivot point when a product should stop optimizing pass@k and start optimizing pass^k. I think of it as the moment when your users switch from "wow" to "why did it fail?"

In my head, the signals look like this:

Support tickets shift from curiosity to frustration.
Users repeat tasks instead of experimenting with new ones.
Teams start building workflows on top of the agent instead of using it as a novelty.
Sales conversations become about ROI and reliability rather than magic.

At that point, pass@k becomes a distraction. The product needs to be predictable, not surprising.

A Practical Shift Plan

The shift from pass@k to pass^k is not just philosophical. It requires different engineering habits.

Here is the practical version I keep using:

Reduce variance in tool usage by enforcing stricter schemas and retries.
Make the agent state explicit so repeated runs are comparable.
Log and regress on the most common failure classes instead of the most impressive wins.
Optimize for time-to-correct-output, not just correctness.
Add guardrails that constrain creativity when the task is procedural.

I also change the way I read eval results. If the average score goes up but variance stays high, I treat that as a red flag. pass^k is allergic to variance. It is better to be slightly less capable and far more consistent than to ship a spiky system that cannot be trusted in production.

This is the boring work. It is also the work that turns a demo into a product.

Automobile assembly line inside a modern factory

How I Explain This To Teams

When I am asked which metric to chase, I try to make it concrete. If you are building a brand-new agent, you are fighting for attention. You need the market to see one clean proof of value. That is pass@k. It is the metric of adoption.

If you are building an agent that people will use every week, the goal changes. You want boring reliability. That is pass^k. It is the metric of retention.

The good teams I have seen do both, but in sequence. They chase the wow moment first, then invest in the unglamorous reliability work that keeps the wow from fading.

Closing Thought

The early game is about earning belief; the late game is about earning trust. Optimize pass@k to get a story, then optimize pass^k to keep the customer.