2026-06-09 · Atharva

pass@k, pass^k, and Reliability: What Enterprise Teams Should Measure

We covered the definitions in pass@k vs pass^k. This post extends that primer for enterprise teams deciding what to gate in CI.

The short version:

pass@k — at least one success in k independent tries
pass^k — success on all k tries

Same symbols, different release contracts.

What enterprise buyers actually ask

Release committees rarely ask for a leaderboard score. They ask:

Can we ship this agent without a human in the loop for this workload?
What happens on the second and third attempt?
If it fails, can we explain why with evidence?

pass@k and pass^k answer questions 2 and 3 when paired with replay. Question 1 still needs policy, cost ceilings, and artifact checks on the scorecard.

Mapping metrics to product promises

Product promise	Primary metric	Gate posture
"Usually works if the user retries"	pass@k with bounded k	Warn on pass^k regression
"Must not fail twice in a row"	pass^k	Block release on any strict failure
"Expensive failures are unacceptable"	pass^k + cost-per-success	Block on cost regression

Document the promise in the challenge pack README so evaluators do not argue about the metric after the run.

How to run both without gaming

Freeze the pack so every attempt sees the same tools and fixtures.
Run k independent sandboxes per candidate (no shared warm state).
Inspect disagreement in replay when pass@k is high but pass^k is low.
Promote clustered failures into new cases before widening k.

AgentClash keeps pack version, replay, and scorecards together so you can audit whether a "pass" was lucky or repeatable. See agent reliability benchmark.

When to tighten from pass@k to pass^k

Teams often start with pass@k while discovering the workload, then tighten once:

tool strategy stabilizes
validators cover the known failure modes
cost per success is predictable

That transition is a maturity signal, not a moral judgment. Early exploration should not be blocked by pass^k before you understand the task.

Enterprise checklist before you gate

Workload encoded as a versioned challenge pack
Baseline agent frozen with explicit model and harness IDs
k chosen to match real retry policy (not "whatever fits the slide")
Replay retention policy agreed with security
Scorecard dimensions match the release committee questions
CI gate documented in the PR template

Need a baseline and gate wired in two weeks? See Benchmark & Gate Setup or start the enterprise rollout.

Explore