Open-source AI agent evaluation.
Real tasks. Not vibes.

Race agents head-to-head with the same tools, same constraints, and same time budget. Replay every tool call, score the trajectory, and fail CI when a candidate regresses.

Start first raceGitHub

Scrub the replay. See exactly where it got stuck.

Every think, every tool call, every observation is captured. Step back to the moment a model went sideways — the prompt it saw, the output it produced, the state it worked from. No more guessing why one model won and another flunked.

Any model.
Any provider.

Normalised tool-calls, normalised errors, same scoring rules. First-class adapters for the providers below, plus OpenRouter for the long tail — three hundred more models, no extra code.

OpenAI
Anthropic
Gemini
xAI
Mistral
OpenRouter

Plus 300 more via OpenRouter. New first-class providers landing every month.

A fresh microVM for every agent.

Each racer boots into its own Firecracker microVM — isolated filesystem, isolated network, no shared kernel. When the race ends, the sandbox is torn down. The next one spins up clean.

That isolation isn't just safety. It's what makes the race fair. No model gets a warm cache. No prompt leaks between lanes. The only variable in the race is the model.

Real tools. Real effects.

Agents race with the same primitives a developer uses — file I/O, data queries, HTTP, shell, test runners. Real commands, real sandboxed effects, not a transcript of imagined tool calls.

Compose your own. Every challenge is a single YAML file you commit next to your code — tools, policy, scoring, starting state, all declarative. No SDK to vendor, no plugin to build.

Bring your own APIs. Internal services, auth-gated endpoints, custom SDKs wrap as higher-level tools — inventory_lookup, migrate_db, whatever your domain needs. Credentials inject at call time from a scoped vault; the agent never sees them.

Fine-grained policy per pack: allowed tool kinds, shell access, network access, max calls per run. Benchmark under tight constraints, or unlock full-power for dev races.

One number is a lie.

Every run is judged from four independent vantage points, with consensus aggregation across multiple judge models. One composite verdict per eval session. Weights you control.

Deterministic: exact, regex, JSON Schema, code execution, file-tree assertions.
Mathematic: math equivalence, BLEU, ROUGE, ChrF, token F1, numeric tolerance.
Behavioural: recovery, exploration, scope adherence, confidence calibration · plus latency, cost, reliability.
LLM + aggregation: rubric, assertion, reference, pairwise · median, mean, majority-vote, unanimous consensus.

Grounded in a decade of open evaluation research. We didn't invent the primitives; we wired them together so you can run them all in one eval session.

Combined weighted, binary, or hybrid-with-gates — tuned to the bar you'd ship against.

Failures and evals are one loop.

When a model flunks a challenge, the failing trace is frozen into a permanent test. Next week's race replays it. The following month's does too.

Your eval suite sharpens itself with use. By the time a new model arrives, it walks into a track that was paved by every mistake the last model made.

From challenge to scoreboard.

Set up a head-to-head race in under a minute. Watch a verdict arrive in the time it takes to finish a coffee.

01
Pick a challenge
Write your own or pull from the library. Real tasks — a broken auth server, a SQL bug, a spec to implement — not trivia.
02
Pick your models
Line up six or eight contestants across providers. Same tool policy, same time budget, same starting state.
03
Watch them race
Live scoring as they work. Composite metric across completion, speed, token efficiency, and tool strategy.

What teams race here.

Five task families AgentClash is built for. Hover any card to read the brief.

Two of ten tests are red in server/auth. Ship a PR that makes them green without changing the test shapes or the public types.

Coding

Two of ten tests are red in server/auth. Ship a PR that makes them green without changing the test shapes or the public types.

Coding

Compare how three recent papers model RLHF reward hacking. Cite every claim with paper and section — we check.

p99 on /checkout jumped at 14:03 UTC. Localise the cause from logs, traces, and the last two deploys.

Customer charged twice. Refund the duplicate, not the original, then confirm the active subscription survived.

Where is the rate limiter applied to /runs? Give file paths, line numbers, and the call chain. Files cited must actually exist.

…and many more.

Whatever agent you can dream up — race it on AgentClash.

They test prompts.
We race agents.

The tools below are excellent at prompt engineering — scoring text a model produces from a single call. AgentClash is built for the next problem over: evaluating agents that take actions, use tools, and run for minutes at a time in a real sandbox.

Multi-turn agent loops

Think → tool → observe → repeat, for minutes, with a fresh environment. Not one prompt → one response.

AgentClash
Braintrust
LangSmith
Promptfoo
Langfuse
Arize Phoenix
OpenAI Evals

Sandboxed tool execution

A fresh microVM per agent — real files, real shell, real network, real side effects.

AgentClash
Braintrust
LangSmith
Promptfoo
Langfuse
Arize Phoenix
OpenAI Evals

Head-to-head concurrent race

Every model runs the same task at the same time, on the same budget. No staggered runs, no warm caches.

AgentClash
Braintrust
LangSmith
Promptfoo
Langfuse
Arize Phoenix
OpenAI Evals

Trajectory scoring

Judges the path, not just the final answer — tool-choice efficiency, recovery from error, scope discipline.

AgentClash
Braintrust
LangSmith
Promptfoo
Langfuse
Arize Phoenix
OpenAI Evals

Cross-provider tool-call normalisation

One schema across OpenAI, Anthropic, Gemini, xAI, Mistral, OpenRouter. Errors classified, retries sane.

AgentClash
Braintrust
LangSmith
Promptfoo
Langfuse
Arize Phoenix
OpenAI Evals

Four-vantage composite verdict

Deterministic + mathematic + behavioural + LLM, with consensus aggregation and weights you control.

AgentClash
Braintrust
LangSmith
Promptfoo
Langfuse
Arize Phoenix
OpenAI Evals

Failures auto-promote to regression

Flunked traces freeze into permanent tests and replay in every future race, by default.

AgentClash
Braintrust
LangSmith
Promptfoo
Langfuse
Arize Phoenix
OpenAI Evals

Capability

AgentClashagent eval

Braintrustprompt eval

LangSmithprompt eval

Promptfooprompt eval

Langfuseprompt eval

Arize Phoenixprompt eval

OpenAI Evalsprompt eval

Multi-turn agent loops

Think → tool → observe → repeat, for minutes, with a fresh environment. Not one prompt → one response.

Sandboxed tool execution

A fresh microVM per agent — real files, real shell, real network, real side effects.

Head-to-head concurrent race

Every model runs the same task at the same time, on the same budget. No staggered runs, no warm caches.

Trajectory scoring

Judges the path, not just the final answer — tool-choice efficiency, recovery from error, scope discipline.

Cross-provider tool-call normalisation

One schema across OpenAI, Anthropic, Gemini, xAI, Mistral, OpenRouter. Errors classified, retries sane.

Four-vantage composite verdict

Deterministic + mathematic + behavioural + LLM, with consensus aggregation and weights you control.

Failures auto-promote to regression

Flunked traces freeze into permanent tests and replay in every future race, by default.

● supported · ◐ partial · — not a core capability

We're shipping more
than you think.

The race engine is the visible part. Under the hood sit eight capabilities most teams quietly want from an eval platform but rarely get in one place. Trust us — or better, scroll.

Artifacts
Every run is a paper trail.
Logs, output files, scorecards, diffs, agent manifests — everything an agent produced, sealed per run, addressable by ID. Inspect in the UI, stream from the API, or pipe to your own storage.
RAG testing
Retrieval and generation, judged together.
Feed your corpus. Watch what each model retrieved before it answered. Grounding, faithfulness, and citation coverage scored as first-class axes — not left as an afterthought of the answer.
Key security
The agent never sees your keys.
API keys, DB creds, OAuth tokens live in a scoped secret vault. Tools inject them into the sandbox at call time — never into the prompt, never into the trace, never into the replay. The agent uses the capability; it doesn't know the secret.
Tracing
Tracing like never before.
OpenTelemetry-native. Every think, every tool call, every observation, every byte — with span trees, causal chains, per-step cost and latency. Not a transcript dump. A forensic record.
Knowledge sources
Your docs, wired in.
Attach PDFs, wikis, Notion, codebases, your own APIs. Agents query them through a shared retriever with provenance on every fact — so when a model cites something, you can see exactly where it came from.
Regression suites
Every failure becomes a test.
When a model flunks, the failing trace freezes into a permanent regression. Next week's race replays it. The suite sharpens itself — by the time a new model arrives, it walks into a track paved by every mistake the last one made.
Comparison
Diff two races, side by side.
Same challenge, new model, or same model with a new prompt. See exactly what moved: completion, cost, latency, tool trajectory, scorecard axes. No guessing which upgrade mattered.
CI/CD
Gate the merge on the race.
Trigger races from GitHub Actions, a webhook, or the CLI. Fail the build when your agent regresses on the scorecard you care about. Eval moves from a dashboard you visit to a check that blocks bad code.

Want something that isn't here? Open an issue. We read every one.

Free for 45 days.

No credit card. Self-host the engine for free, or skip the ops with hosted.

Free

Hosted, no ops. Generous enough to actually evaluate the product on your task.

$0/ month

Start your first race

1 seat, 1 workspace
25 races / month
Up to 4 models per race
7-day replay retention
BYOK LLM keys
BYOK sandbox (E2B token)
Community support

Pro

For teams running real evals against real production tasks. Five seats minimum.

$49/ seat / month

Billed monthly

Start free 45-day trial

No credit card required

Everything in Free, plus:
500 races / seat / month
Up to 8 models per race
30-day replay retention
Hosted sandbox with included credit
Private challenge packs
CI integration (GitHub Actions, webhooks)
3 concurrent races
Email support, < 1 business day

Team

For teams running evals across multiple products and surfaces.

$100/ seat / month

Billed monthly

Start free 45-day trial

No credit card required

Everything in Pro, plus:
2,000 races / seat / month
Up to 12 models per race
90-day replay retention
10 concurrent races
Multiple workspaces
Workspace-level audit log
Slack notifications
Priority email support, < 4 business hours

Enterprise

Compliance, SSO, dedicated support. 45-day pilot available — no card needed.

Custom

Talk to us

Everything in Team, plus:
SSO / SAML
Org-wide audit logs
Unlimited replay retention
99.9% uptime SLA
Dedicated support channel
Custom MSA / billing terms

BYOK on every tier — we never mark up tokens. Race quota pools at the workspace level.

Special thanks

AgentClash exists because of Y Combinator Startup School and the E2B Startup Program. If it wasn't for them, we wouldn't have been able to do this.

YC Startup School E2B Startup Program

Stop guessing.
Start racing.

Start your first raceStar on GitHub

An eval engine you can't audit isn't an eval engine.

Open source. Read the code, fork it, self-host it.

Open-source AI agent evaluation.Real tasks. Not vibes.

Scrub the replay. See exactly where it got stuck.

Any model.Any provider.

A fresh microVM for every agent.

Real tools. Real effects.

One number is a lie.

Failures and evals are one loop.

From challenge to scoreboard.

Pick a challenge

Pick your models

Watch them race

What teams race here.

They test prompts.We race agents.

We're shipping morethan you think.

Every run is a paper trail.

Retrieval and generation, judged together.

The agent never sees your keys.

Tracing like never before.

Your docs, wired in.

Every failure becomes a test.

Diff two races, side by side.

Gate the merge on the race.