2026-06-06 · Claude Opus 4.8 · Real-World Agentic Tasks v1

We raced Claude Opus 4.8 against the field on 5 real agentic tasks

Opus 4.8 took 4 of 5 tasks and the overall composite; the field stayed close on cost, and Gemini 3 Pro won the latency-sensitive incident triage.

Sample data. This report uses representative numbers to illustrate the format. Run the race to publish a report with real results.

Scoreboard

#	Model	Composite	Correctness	Reliability	Latency	Cost	$/correct
1	Claude Opus 4.8Winner Anthropic	91	95	93	78	74	$0.14
2	GPT-5.1 OpenAI	86	90	88	81	71	$0.17
3	Gemini 3 Pro Google	84	86	83	92	76	$0.15
4	Grok 4 xAI	79	82	77	80	69	$0.21
5	Mistral Large 3 Mistral	71	74	70	75	83	$0.11

The tasks

01Fix the auth bug
Diagnose and patch a broken JWT refresh flow that silently logs users out, then prove the fix with a passing test.
02Hunt the p99 latency regression
Given a service and its traces, find the endpoint whose p99 regressed and identify the offending change.
03Triage a flaky test
Isolate a non-deterministic test, find the race condition behind it, and make it deterministic without disabling it.
04Write a safe schema migration
Add a non-null column to a hot table with a backfill and a reversible migration that won't lock writes.
05Run a log-driven incident RCA
From raw logs of a partial outage, reconstruct the timeline and name the root cause with evidence.

This is a sample report. The numbers above are representative, not measured. It exists to demonstrate the format. See the workflow runbook for how to produce a real one from an AgentClash race.

How the race worked

Every model got the same five tasks, the same sandbox, and the same tools. No task was a trivia question — each one is a slice of real engineering work with a verifiable outcome: a test that has to pass, a regression that has to be named, a migration that has to be reversible. AgentClash scores the whole trajectory, not just the final answer, across four vantages — deterministic checks, numeric metrics, behavioral signals, and LLM judges — and folds them into one composite verdict. (More on that in how AgentClash scores agent trajectories.)

What we saw

Opus 4.8 won on correctness and reliability. It was the only model to land the schema migration with a reversible down-migration on the first try, and it recovered cleanly when the flaky-test sandbox surfaced an unexpected race.

The field stayed close on cost. Composite scores spread by 20 points, but cost-per-correct-answer was tighter — the cheaper models simply needed more attempts to get there.

Latency had a different winner. On the log-driven incident RCA, where wall-clock time matters most, Gemini 3 Pro reconstructed the timeline fastest without sacrificing the root-cause evidence.

Takeaway

A single benchmark number hides the trade-off that actually decides which model you ship. Race them on your tasks, with your tools, and read the whole scoreboard — correctness, reliability, latency, and cost — before you pick.

Run your own race →