Open-source AI agent evaluation.
Real tasks. Not vibes.

Race agents head-to-head with the same tools, same constraints, and same time budget. Replay every tool call, score the trajectory, and fail CI when a candidate regresses.

Scrub the replay. See exactly where it got stuck.

Every think, every tool call, every observation is captured. Step back to the moment a model went sideways — the prompt it saw, the output it produced, the state it worked from. No more guessing why one model won and another flunked.

Any model.
Any provider.

Normalised tool-calls, normalised errors, same scoring rules. First-class adapters for the providers below, plus OpenRouter for the long tail — three hundred more models, no extra code.

  • OpenAI
    OpenAI
  • Anthropic
    Anthropic
  • Gemini
    Gemini
  • Grok
    xAI
  • Mistral
    Mistral
  • OpenRouter
    OpenRouter

Plus 300 more via OpenRouter. New first-class providers landing every month.

A fresh microVM for every agent.

Each racer boots into its own Firecracker microVM — isolated filesystem, isolated network, no shared kernel. When the race ends, the sandbox is torn down. The next one spins up clean.

That isolation isn't just safety. It's what makes the race fair. No model gets a warm cache. No prompt leaks between lanes. The only variable in the race is the model.

Powered by E2B — the sandbox infrastructure behind AI products at Perplexity, Hugging Face, and Groq.

Real tools. Real effects.

Agents race with the same primitives a developer uses — file I/O, data queries, HTTP, shell, test runners. Real commands, real sandboxed effects, not a transcript of imagined tool calls.

Compose your own. Every challenge is a single YAML file you commit next to your code — tools, policy, scoring, starting state, all declarative. No SDK to vendor, no plugin to build.

Bring your own APIs. Internal services, auth-gated endpoints, custom SDKs wrap as higher-level tools — inventory_lookup, migrate_db, whatever your domain needs. Credentials inject at call time from a scoped vault; the agent never sees them.

Fine-grained policy per pack: allowed tool kinds, shell access, network access, max calls per run. Benchmark under tight constraints, or unlock full-power for dev races.

One number is a lie.

Every run is judged from four independent vantage points, with consensus aggregation across multiple judge models. One composite verdict per eval session. Weights you control.

Deterministic
exact, regex, JSON Schema, code execution, file-tree assertions.
Mathematic
math equivalence, BLEU, ROUGE, ChrF, token F1, numeric tolerance.
Behavioural
recovery, exploration, scope adherence, confidence calibration · plus latency, cost, reliability.
LLM + aggregation
rubric, assertion, reference, pairwise · median, mean, majority-vote, unanimous consensus.

Grounded in a decade of open evaluation research. We didn't invent the primitives; we wired them together so you can run them all in one eval session.

Combined weighted, binary, or hybrid-with-gates — tuned to the bar you'd ship against.

Failures and evals are one loop.

When a model flunks a challenge, the failing trace is frozen into a permanent test. Next week's race replays it. The following month's does too.

Your eval suite sharpens itself with use. By the time a new model arrives, it walks into a track that was paved by every mistake the last model made.

From challenge to scoreboard.

Set up a head-to-head race in under a minute. Watch a verdict arrive in the time it takes to finish a coffee.

  1. 01

    Pick a challenge

    Write your own or pull from the library. Real tasks — a broken auth server, a SQL bug, a spec to implement — not trivia.

  2. 02

    Pick your models

    Line up six or eight contestants across providers. Same tool policy, same time budget, same starting state.

  3. 03

    Watch them race

    Live scoring as they work. Composite metric across completion, speed, token efficiency, and tool strategy.

What teams race here.

Five task families AgentClash is built for. Hover any card to read the brief.

01

Two of ten tests are red in server/auth. Ship a PR that makes them green without changing the test shapes or the public types.

Coding

02

03

04

05

01

Two of ten tests are red in server/auth. Ship a PR that makes them green without changing the test shapes or the public types.

Coding

02

Compare how three recent papers model RLHF reward hacking. Cite every claim with paper and section — we check.

03

p99 on /checkout jumped at 14:03 UTC. Localise the cause from logs, traces, and the last two deploys.

04

Customer charged twice. Refund the duplicate, not the original, then confirm the active subscription survived.

05

Where is the rate limiter applied to /runs? Give file paths, line numbers, and the call chain. Files cited must actually exist.

…and many more.

Whatever agent you can dream up — race it on AgentClash.

They test prompts.
We race agents.

The tools below are excellent at prompt engineering — scoring text a model produces from a single call. AgentClash is built for the next problem over: evaluating agents that take actions, use tools, and run for minutes at a time in a real sandbox.

Multi-turn agent loops

Think → tool → observe → repeat, for minutes, with a fresh environment. Not one prompt → one response.

AgentClash
agent eval
Braintrust
prompt eval
LangSmith
prompt eval
Promptfoo
prompt eval
Langfuse
prompt eval
Arize Phoenix
prompt eval
OpenAI Evals
prompt eval

Sandboxed tool execution

A fresh microVM per agent — real files, real shell, real network, real side effects.

AgentClash
agent eval
Braintrust
prompt eval
LangSmith
prompt eval
Promptfoo
prompt eval
Langfuse
prompt eval
Arize Phoenix
prompt eval
OpenAI Evals
prompt eval

Head-to-head concurrent race

Every model runs the same task at the same time, on the same budget. No staggered runs, no warm caches.

AgentClash
agent eval
Braintrust
prompt eval
LangSmith
prompt eval
Promptfoo
prompt eval
Langfuse
prompt eval
Arize Phoenix
prompt eval
OpenAI Evals
prompt eval

Trajectory scoring

Judges the path, not just the final answer — tool-choice efficiency, recovery from error, scope discipline.

AgentClash
agent eval
Braintrust
prompt eval
LangSmith
prompt eval
Promptfoo
prompt eval
Langfuse
prompt eval
Arize Phoenix
prompt eval
OpenAI Evals
prompt eval

Cross-provider tool-call normalisation

One schema across OpenAI, Anthropic, Gemini, xAI, Mistral, OpenRouter. Errors classified, retries sane.

AgentClash
agent eval
Braintrust
prompt eval
LangSmith
prompt eval
Promptfoo
prompt eval
Langfuse
prompt eval
Arize Phoenix
prompt eval
OpenAI Evals
prompt eval

Four-vantage composite verdict

Deterministic + mathematic + behavioural + LLM, with consensus aggregation and weights you control.

AgentClash
agent eval
Braintrust
prompt eval
LangSmith
prompt eval
Promptfoo
prompt eval
Langfuse
prompt eval
Arize Phoenix
prompt eval
OpenAI Evals
prompt eval

Failures auto-promote to regression

Flunked traces freeze into permanent tests and replay in every future race, by default.

AgentClash
agent eval
Braintrust
prompt eval
LangSmith
prompt eval
Promptfoo
prompt eval
Langfuse
prompt eval
Arize Phoenix
prompt eval
OpenAI Evals
prompt eval

  supported  ·    partial  ·   not a core capability

We're shipping more
than you think.

The race engine is the visible part. Under the hood sit eight capabilities most teams quietly want from an eval platform but rarely get in one place. Trust us — or better, scroll.

  • Artifacts

    Every run is a paper trail.

    Logs, output files, scorecards, diffs, agent manifests — everything an agent produced, sealed per run, addressable by ID. Inspect in the UI, stream from the API, or pipe to your own storage.

  • RAG testing

    Retrieval and generation, judged together.

    Feed your corpus. Watch what each model retrieved before it answered. Grounding, faithfulness, and citation coverage scored as first-class axes — not left as an afterthought of the answer.

  • Key security

    The agent never sees your keys.

    API keys, DB creds, OAuth tokens live in a scoped secret vault. Tools inject them into the sandbox at call time — never into the prompt, never into the trace, never into the replay. The agent uses the capability; it doesn't know the secret.

  • Tracing

    Tracing like never before.

    OpenTelemetry-native. Every think, every tool call, every observation, every byte — with span trees, causal chains, per-step cost and latency. Not a transcript dump. A forensic record.

  • Knowledge sources

    Your docs, wired in.

    Attach PDFs, wikis, Notion, codebases, your own APIs. Agents query them through a shared retriever with provenance on every fact — so when a model cites something, you can see exactly where it came from.

  • Regression suites

    Every failure becomes a test.

    When a model flunks, the failing trace freezes into a permanent regression. Next week's race replays it. The suite sharpens itself — by the time a new model arrives, it walks into a track paved by every mistake the last one made.

  • Comparison

    Diff two races, side by side.

    Same challenge, new model, or same model with a new prompt. See exactly what moved: completion, cost, latency, tool trajectory, scorecard axes. No guessing which upgrade mattered.

  • CI/CD

    Gate the merge on the race.

    Trigger races from GitHub Actions, a webhook, or the CLI. Fail the build when your agent regresses on the scorecard you care about. Eval moves from a dashboard you visit to a check that blocks bad code.

Want something that isn't here? Open an issue. We read every one.

Free for 45 days.

No credit card. Self-host the engine for free, or skip the ops with hosted.

Free

Hosted, no ops. Generous enough to actually evaluate the product on your task.

$0/ month
  • 1 seat, 1 workspace
  • 25 races / month
  • Up to 4 models per race
  • 7-day replay retention
  • BYOK LLM keys
  • BYOK sandbox (E2B token)
  • Community support

Pro

For teams running real evals against real production tasks. Five seats minimum.

$49/ seat / month
Billed monthly
Start free 45-day trial

No credit card required

  • Everything in Free, plus:
  • 500 races / seat / month
  • Up to 8 models per race
  • 30-day replay retention
  • Hosted sandbox with included credit
  • Private challenge packs
  • CI integration (GitHub Actions, webhooks)
  • 3 concurrent races
  • Email support, < 1 business day

Team

For teams running evals across multiple products and surfaces.

$100/ seat / month
Billed monthly
Start free 45-day trial

No credit card required

  • Everything in Pro, plus:
  • 2,000 races / seat / month
  • Up to 12 models per race
  • 90-day replay retention
  • 10 concurrent races
  • Multiple workspaces
  • Workspace-level audit log
  • Slack notifications
  • Priority email support, < 4 business hours

Enterprise

Compliance, SSO, dedicated support. 45-day pilot available — no card needed.

Custom
  • Everything in Team, plus:
  • SSO / SAML
  • Org-wide audit logs
  • Unlimited replay retention
  • 99.9% uptime SLA
  • Dedicated support channel
  • Custom MSA / billing terms

BYOK on every tier — we never mark up tokens. Race quota pools at the workspace level.

Special thanks

AgentClash exists because of Y Combinator Startup School and the E2B Startup Program. If it wasn't for them, we wouldn't have been able to do this.

Stop guessing.
Start racing.

An eval engine you can't audit isn't an eval engine.

Open source. Read the code, fork it, self-host it.

AgentClash - Open-source AI Agent Evaluation Platform