← Benchmarks

2026-06-06 · Claude Opus 4.8 · Real-World Agentic Tasks v1

We raced Claude Opus 4.8 against the field on 5 real agentic tasks

Opus 4.8 took 4 of 5 tasks and the overall composite; the field stayed close on cost, and Gemini 3 Pro won the latency-sensitive incident triage.

Sample data. This report uses representative numbers to illustrate the format. Run the race to publish a report with real results.

Scoreboard

#ModelCompositeCorrectnessReliabilityLatencyCost$/correct
1
Claude Opus 4.8Winner
Anthropic
9195937874$0.14
2
GPT-5.1
OpenAI
8690888171$0.17
3
Gemini 3 Pro
Google
8486839276$0.15
4
Grok 4
xAI
7982778069$0.21
5
Mistral Large 3
Mistral
7174707583$0.11

The tasks

  1. 01Fix the auth bug

    Diagnose and patch a broken JWT refresh flow that silently logs users out, then prove the fix with a passing test.

  2. 02Hunt the p99 latency regression

    Given a service and its traces, find the endpoint whose p99 regressed and identify the offending change.

  3. 03Triage a flaky test

    Isolate a non-deterministic test, find the race condition behind it, and make it deterministic without disabling it.

  4. 04Write a safe schema migration

    Add a non-null column to a hot table with a backfill and a reversible migration that won't lock writes.

  5. 05Run a log-driven incident RCA

    From raw logs of a partial outage, reconstruct the timeline and name the root cause with evidence.

This is a sample report. The numbers above are representative, not measured. It exists to demonstrate the format. See the workflow runbook for how to produce a real one from an AgentClash race.

How the race worked

Every model got the same five tasks, the same sandbox, and the same tools. No task was a trivia question — each one is a slice of real engineering work with a verifiable outcome: a test that has to pass, a regression that has to be named, a migration that has to be reversible. AgentClash scores the whole trajectory, not just the final answer, across four vantages — deterministic checks, numeric metrics, behavioral signals, and LLM judges — and folds them into one composite verdict. (More on that in how AgentClash scores agent trajectories.)

What we saw

Opus 4.8 won on correctness and reliability. It was the only model to land the schema migration with a reversible down-migration on the first try, and it recovered cleanly when the flaky-test sandbox surfaced an unexpected race.

The field stayed close on cost. Composite scores spread by 20 points, but cost-per-correct-answer was tighter — the cheaper models simply needed more attempts to get there.

Latency had a different winner. On the log-driven incident RCA, where wall-clock time matters most, Gemini 3 Pro reconstructed the timeline fastest without sacrificing the root-cause evidence.

Takeaway

A single benchmark number hides the trade-off that actually decides which model you ship. Race them on your tasks, with your tools, and read the whole scoreboard — correctness, reliability, latency, and cost — before you pick.

Run your own race →

We raced Claude Opus 4.8 against the field on 5 real agentic tasks — AgentClash