How is AgentClash different from a static leaderboard?

Leaderboards summarize one-off scores on generic tasks. AgentClash runs same-task evals on frozen challenge packs with the same tools, sandbox policy, and iteration budget, then publishes replay evidence and scorecards you can reuse as regression gates.

What gets frozen in a public benchmark?

The challenge pack version, runtime constraints, tool policy, scoring spec, and input cases. Every model in an eval run sees the same workload so differences show up in trajectories, not setup drift.

Can we run the same benchmark on our agents?

Yes. Book an eval workshop or start a Team workspace. We pin the same pack on your workloads, baseline your current agent, and deliver scorecards and CI gates your platform team can ship on.

How often do you publish benchmark reports?

We publish when major models ship and on a monthly reliability cadence. Subscribe to the benchmarks RSS feed or check this hub for the latest same-task summary and links to full replays.

AgentClash

Benchmarks

Head-to-head AI agent benchmarks you can reproduce

Public eval runs on frozen challenge packs: same tools, same constraints, full trajectory scoring, and replay evidence. Not a vibes leaderboard. Run the same benchmark on your agents when you are ready to gate releases.

Latest report

We raced four GPT generations on a real coding task — GPT-5.4 won on efficiency

GPT-5.4 solved the task in 2 turns and ~8K tokens; GPT-5 and GPT-5.5 also passed, while GPT-4.1 thrashed for its full budget and failed to submit.

2026-06-07 · Expression Evaluator Arena v1 · GPT-5.4

#	Model	Composite	Correctness	Reliability	Latency	Cost	$/correct
1	GPT-5.4Winner OpenAI	100	100	100	—	100	—
2	GPT-5 OpenAI	100	100	100	—	55	—
3	GPT-5.5 OpenAI	88	100	100	—	46	—
4	GPT-4.1 OpenAI	0	0	0	—	20	—

Read the full report Coding agent benchmark — June 2026

Methodology

Races, not leaderboard snapshots

Static leaderboards hide setup drift. AgentClash benchmarks are same-task eval runs on pinned packs so you can compare models, replay evidence, and promote the same workload to a CI gate.

Frozen challenge pack

Each eval run pins a versioned YAML pack: prompts, tools, sandbox, evaluation spec, and input cases. No moving targets mid-benchmark.

Same runtime constraints

Every candidate gets the same sandbox, tool policy, network rules, and iteration budget so comparisons stay fair.

Baseline vs candidate

Enterprise teams reuse the same packs to compare a ship candidate against a known baseline and fail CI when scorecards regress.

Scorecard dimensions

Composite scores fold in correctness, reliability, latency, cost, behavioral signals, and judge evidence from the full trajectory.

Replay and evidence

Runs preserve tool calls, artifacts, and judge rationale so reviewers can audit a verdict without rerunning the eval.

Gate verdict

The same workload can power a public benchmark today and a release gate tomorrow once your team trusts the scoring rules.

Monthly cadence

How reports stay reproducible

01Pin a challenge pack version, git commit, and runtime constraints in the report appendix.
02Run every candidate on the same pack with identical tools, sandbox, and budgets.
03Export ranking JSON (`agentclash run ranking`) and attach replay plus validator evidence.
04Publish measured MDX on /benchmarks and a shareable monthly blog summary.
05Summarize on this hub, RSS feed, and changelog; promote failures into pack coverage.

Owner checklist and scorecard export format: monthly benchmark runbook.

Go deeper

Benchmark pages for your team

AI agent benchmarkBenchmark agents on workloads your team owns, with replay and scorecards instead of leaderboard-only snapshots.Agent reliability benchmarkTrack pass rates, cost drift, and promoted failures with repeatable packs and CI regression gates.

Challenge packs

Reproduce the workload

Challenge pack referenceAuthor versioned YAML packs with scoring, tools, sandbox policy, and eval workflows.Eval workflows and gatesWire baseline versus candidate comparisons into CI release policy.Example packs in the repoStart from expression evaluators, refund recovery, incident response, and security stress packs.

Blog posts on benchmarking agents

All reports

Every same-task eval we have published

2026-06-07 · GPT-5.4We raced four GPT generations on a real coding task — GPT-5.4 won on efficiencyGPT-5.4 solved the task in 2 turns and ~8K tokens; GPT-5 and GPT-5.5 also passed, while GPT-4.1 thrashed for its full budget and failed to submit.