2026-06-06 · Claude Opus 4.8 · Real-World Agentic Tasks v1
We raced Claude Opus 4.8 against the field on 5 real agentic tasks
Opus 4.8 took 4 of 5 tasks and the overall composite; the field stayed close on cost, and Gemini 3 Pro won the latency-sensitive incident triage.
Scoreboard
| # | Model | Composite | Correctness | Reliability | Latency | Cost | $/correct |
|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.8Winner Anthropic | 91 | 95 | 93 | 78 | 74 | $0.14 |
| 2 | GPT-5.1 OpenAI | 86 | 90 | 88 | 81 | 71 | $0.17 |
| 3 | Gemini 3 Pro Google | 84 | 86 | 83 | 92 | 76 | $0.15 |
| 4 | Grok 4 xAI | 79 | 82 | 77 | 80 | 69 | $0.21 |
| 5 | Mistral Large 3 Mistral | 71 | 74 | 70 | 75 | 83 | $0.11 |
The tasks
- 01Fix the auth bug
Diagnose and patch a broken JWT refresh flow that silently logs users out, then prove the fix with a passing test.
- 02Hunt the p99 latency regression
Given a service and its traces, find the endpoint whose p99 regressed and identify the offending change.
- 03Triage a flaky test
Isolate a non-deterministic test, find the race condition behind it, and make it deterministic without disabling it.
- 04Write a safe schema migration
Add a non-null column to a hot table with a backfill and a reversible migration that won't lock writes.
- 05Run a log-driven incident RCA
From raw logs of a partial outage, reconstruct the timeline and name the root cause with evidence.
This is a sample report. The numbers above are representative, not measured. It exists to demonstrate the format. See the workflow runbook for how to produce a real one from an AgentClash race.
How the race worked
Every model got the same five tasks, the same sandbox, and the same tools. No task was a trivia question — each one is a slice of real engineering work with a verifiable outcome: a test that has to pass, a regression that has to be named, a migration that has to be reversible. AgentClash scores the whole trajectory, not just the final answer, across four vantages — deterministic checks, numeric metrics, behavioral signals, and LLM judges — and folds them into one composite verdict. (More on that in how AgentClash scores agent trajectories.)
What we saw
Opus 4.8 won on correctness and reliability. It was the only model to land the schema migration with a reversible down-migration on the first try, and it recovered cleanly when the flaky-test sandbox surfaced an unexpected race.
The field stayed close on cost. Composite scores spread by 20 points, but cost-per-correct-answer was tighter — the cheaper models simply needed more attempts to get there.
Latency had a different winner. On the log-driven incident RCA, where wall-clock time matters most, Gemini 3 Pro reconstructed the timeline fastest without sacrificing the root-cause evidence.
Takeaway
A single benchmark number hides the trade-off that actually decides which model you ship. Race them on your tasks, with your tools, and read the whole scoreboard — correctness, reliability, latency, and cost — before you pick.