Scrub the replay. See exactly where it got stuck.
Every think, every tool call, every observation is captured. Step back to the moment a model went sideways — the prompt it saw, the output it produced, the state it worked from. No more guessing why one model won and another flunked.
Any model.
Any provider.
Normalised tool-calls, normalised errors, same scoring rules. First-class adapters for the providers below, plus OpenRouter for the long tail — three hundred more models, no extra code.
- OpenAI
- Anthropic
- Gemini
- xAI
- Mistral
- OpenRouter
Plus 300 more via OpenRouter. New first-class providers landing every month.
A fresh microVM for every agent.
Each agent boots into its own Firecracker microVM — isolated filesystem, isolated network, no shared kernel. When the run ends, the sandbox is torn down. The next one spins up clean.
That isolation isn't just safety. It's what makes the comparison fair. No model gets a warm cache. No prompt leaks between lanes. The only variable is the agent you are testing.
Powered by E2B — the sandbox infrastructure behind AI products at Perplexity, Hugging Face, and Groq.
Real tools. Real effects.
Agents work with the same primitives a developer uses — file I/O, data queries, HTTP, shell, test runners. Real commands, real sandboxed effects, not a transcript of imagined tool calls.
Compose your own. Every challenge is a single YAML file you commit next to your code — tools, policy, scoring, starting state, all declarative. No SDK to vendor, no plugin to build.
Bring your own APIs. Internal services, auth-gated endpoints, custom SDKs wrap as higher-level tools — inventory_lookup, migrate_db, whatever your domain needs. Credentials inject at call time from a scoped vault; the agent never sees them.
Fine-grained policy per pack: allowed tool kinds, shell access, network access, max calls per run. Benchmark under tight constraints, or unlock full-power for development evals.
One number is a lie.
Every run is judged from four independent vantage points, with consensus aggregation across multiple judge models. One composite verdict per eval session. Weights you control.
- Deterministic
- exact, regex, JSON Schema, code execution, file-tree assertions.
- Mathematic
- math equivalence, BLEU, ROUGE, ChrF, token F1, numeric tolerance.
- Behavioural
- recovery, exploration, scope adherence, confidence calibration · plus latency, cost, reliability.
- LLM + aggregation
- rubric, assertion, reference, pairwise · median, mean, majority-vote, unanimous consensus.
Grounded in a decade of open evaluation research. We didn't invent the primitives; we wired them together so you can run them all in one eval session.
Combined weighted, binary, or hybrid-with-gates — tuned to the bar you'd ship against.
Failures and evals are one loop.
When a model flunks a challenge, the failing trace is frozen into a permanent test. Every future eval replays it. The following month's does too.
Your eval suite sharpens itself with use. Each escaped failure becomes a regression the next agent cannot skip.
From challenge to scoreboard.
Set up an eval in under a minute. Inspect failures in the time it takes to finish a coffee.
01
Pick a challenge
Write your own or pull from the library. Real tasks — a broken auth server, a SQL bug, a spec to implement — not trivia.
02
Run your agents
Point AgentClash at the models or harnesses you ship. Same tool policy, same time budget, same starting state.
03
Inspect failures
Scrub the replay to the step where things broke. Scorecards show completion, cost, latency, and tool strategy.
What teams debug here.
Five task families AgentClash is built for. Hover any card to read the brief.
- Coding
- Research
- SRE
- Multi-step ops
- Codebase Q&A
01
Two of ten tests are red in server/auth. Ship a PR that makes them green without changing the test shapes or the public types.
Coding
02
03
04
05
01
Two of ten tests are red in server/auth. Ship a PR that makes them green without changing the test shapes or the public types.
Coding
02
Compare how three recent papers model RLHF reward hacking. Cite every claim with paper and section — we check.
03
p99 on /checkout jumped at 14:03 UTC. Localise the cause from logs, traces, and the last two deploys.
04
Customer charged twice. Refund the duplicate, not the original, then confirm the active subscription survived.
05
Where is the rate limiter applied to /runs? Give file paths, line numbers, and the call chain. Files cited must actually exist.
…and many more.
Whatever agent you ship — debug it on AgentClash.
They test prompts.
We debug agents.
The tools below are excellent at prompt engineering — scoring text a model produces from a single call. AgentClash is built for the next problem over: evaluating agents that take actions, use tools, and run for minutes at a time in a real sandbox.
Multi-turn agent loops
Think → tool → observe → repeat, for minutes, with a fresh environment. Not one prompt → one response.
- AgentClash agent eval
- Yes
- Braintrust prompt eval
- Partial
- LangSmith prompt eval
- Partial
- Promptfoo prompt eval
- No
- Langfuse prompt eval
- Partial
- Arize Phoenix prompt eval
- Partial
- OpenAI Evals prompt eval
- Partial
Sandboxed tool execution
A fresh microVM per agent — real files, real shell, real network, real side effects.
- AgentClash agent eval
- Yes
- Braintrust prompt eval
- No
- LangSmith prompt eval
- No
- Promptfoo prompt eval
- No
- Langfuse prompt eval
- No
- Arize Phoenix prompt eval
- No
- OpenAI Evals prompt eval
- No
Same-task concurrent eval
Every model runs the same task at the same time, on the same budget. No staggered runs, no warm caches.
- AgentClash agent eval
- Yes
- Braintrust prompt eval
- No
- LangSmith prompt eval
- No
- Promptfoo prompt eval
- No
- Langfuse prompt eval
- No
- Arize Phoenix prompt eval
- No
- OpenAI Evals prompt eval
- No
Trajectory scoring
Judges the path, not just the final answer — tool-choice efficiency, recovery from error, scope discipline.
- AgentClash agent eval
- Yes
- Braintrust prompt eval
- Partial
- LangSmith prompt eval
- Partial
- Promptfoo prompt eval
- No
- Langfuse prompt eval
- Partial
- Arize Phoenix prompt eval
- Partial
- OpenAI Evals prompt eval
- No
Cross-provider tool-call normalisation
One schema across OpenAI, Anthropic, Gemini, xAI, Mistral, OpenRouter. Errors classified, retries sane.
- AgentClash agent eval
- Yes
- Braintrust prompt eval
- Partial
- LangSmith prompt eval
- Partial
- Promptfoo prompt eval
- Partial
- Langfuse prompt eval
- Partial
- Arize Phoenix prompt eval
- Partial
- OpenAI Evals prompt eval
- No
Four-vantage composite verdict
Deterministic + mathematic + behavioural + LLM, with consensus aggregation and weights you control.
- AgentClash agent eval
- Yes
- Braintrust prompt eval
- Partial
- LangSmith prompt eval
- Partial
- Promptfoo prompt eval
- Partial
- Langfuse prompt eval
- Partial
- Arize Phoenix prompt eval
- Partial
- OpenAI Evals prompt eval
- Partial
Failures auto-promote to regression
Flunked traces freeze into permanent tests and replay in every future eval, by default.
- AgentClash agent eval
- Yes
- Braintrust prompt eval
- Partial
- LangSmith prompt eval
- Partial
- Promptfoo prompt eval
- Partial
- Langfuse prompt eval
- Partial
- Arize Phoenix prompt eval
- Partial
- OpenAI Evals prompt eval
- No
Capability
Multi-turn agent loops
Think → tool → observe → repeat, for minutes, with a fresh environment. Not one prompt → one response.
Sandboxed tool execution
A fresh microVM per agent — real files, real shell, real network, real side effects.
Same-task concurrent eval
Every model runs the same task at the same time, on the same budget. No staggered runs, no warm caches.
Trajectory scoring
Judges the path, not just the final answer — tool-choice efficiency, recovery from error, scope discipline.
Cross-provider tool-call normalisation
One schema across OpenAI, Anthropic, Gemini, xAI, Mistral, OpenRouter. Errors classified, retries sane.
Four-vantage composite verdict
Deterministic + mathematic + behavioural + LLM, with consensus aggregation and weights you control.
Failures auto-promote to regression
Flunked traces freeze into permanent tests and replay in every future eval, by default.
● supported · ◐ partial · — not a core capability
We're shipping more
than you think.
Failure reports are the visible part. Under the hood sit eight capabilities most teams quietly want from an eval platform but rarely get in one place. Trust us — or better, scroll.
Artifacts, RAG scoring, secret-vault key isolation, full tool-call tracing, and CI regression gates — eight evaluation capabilities in one open-source AI agent evaluation platform.
Artifacts
Every run is a paper trail.
Logs, output files, scorecards, diffs, agent manifests — everything an agent produced, sealed per run, addressable by ID. Inspect in the UI, stream from the API, or pipe to your own storage.
RAG testing
Retrieval and generation, judged together.
Feed your corpus. Watch what each model retrieved before it answered. Grounding, faithfulness, and citation coverage scored as first-class axes — not left as an afterthought of the answer.
Key security
The agent never sees your keys.
API keys, DB creds, OAuth tokens live in a scoped secret vault. Tools inject them into the sandbox at call time — never into the prompt, never into the trace, never into the replay. The agent uses the capability; it doesn't know the secret.
Tracing
Tracing like never before.
OpenTelemetry-native. Every think, every tool call, every observation, every byte — with span trees, causal chains, per-step cost and latency. Not a transcript dump. A forensic record.
Knowledge sources
Your docs, wired in.
Attach PDFs, wikis, Notion, codebases, your own APIs. Agents query them through a shared retriever with provenance on every fact — so when a model cites something, you can see exactly where it came from.
Regression suites
Every failure becomes a test.
When a model flunks, the failing trace freezes into a permanent regression. Every future eval replays it. The suite sharpens itself — each escaped failure becomes a test the next agent cannot skip.
Comparison
Diff two runs, side by side.
Same challenge, new model, or same model with a new prompt. See exactly what moved: completion, cost, latency, tool trajectory, scorecard axes. No guessing which upgrade mattered.
CI/CD
Gate the merge on the eval.
Trigger evals from GitHub Actions, a webhook, or the CLI. Fail the build when your agent regresses on the scorecard you care about. Eval moves from a dashboard you visit to a check that blocks bad code.
Want something that isn't here? Open an issue. We read every one.
Run real evals for free.
Start on hosted Free or self-host the engine. Upgrade only when you need more runs, retention, or governance.
Free
Run real evals first. Upgrade only when you need more runs, retention, or team controls.
- 1 workspace
- 25 eval runs / month
- Up to 4 models per run
- 7-day replay retention
- BYOK LLM keys
- BYOK sandbox (E2B token)
- Community support
Pro
For teams moving from evaluation to repeated release checks.
Start on Free, pay when you need more
- Everything in Free, plus:
- 500 eval runs / workspace / month
- Up to 8 models per run
- 30-day replay retention
- Hosted sandbox with included credit
- Private challenge packs
- CI integration (GitHub Actions, webhooks)
- 3 concurrent eval runs
- Email support, < 1 business day
Team
For teams running evals across multiple products and surfaces.
For higher run volume and governance
- Everything in Pro, plus:
- 2,000 eval runs / workspace / month
- Up to 12 models per run
- 90-day replay retention
- 10 concurrent eval runs
- Multiple workspaces
- Workspace-level audit log
- Slack notifications
- Priority email support, < 4 business hours
Enterprise
Compliance, SSO, dedicated support, and paid rollout help.
- Everything in Team, plus:
- SSO / SAML
- Org-wide audit logs
- Unlimited replay retention
- 99.9% uptime SLA
- Dedicated support channel
- Custom MSA / billing terms
BYOK on every tier — we never mark up tokens. Eval quotas pool at the workspace level.
Special thanks
AgentClash exists because of Y Combinator Startup School and the E2B Startup Program. If it wasn't for them, we wouldn't have been able to do this.
Stop guessing.
Start evaluating.
Book an eval workshop to baseline your agents on real workloads — or run your first eval in the hosted product when your team is ready to self-serve.
An eval engine you can't audit isn't an eval engine.
Open source. Read the code, fork it, self-host it.
Questions, answered.
What is AgentClash?
AgentClash is an open-source AI agent evaluation platform. It runs your agents on real tasks with the same tools and constraints, captures replayable failure evidence, scores the full trajectory, and lets you promote failed runs into permanent regression tests.
How is AgentClash different from prompt-evaluation tools like LangSmith or Braintrust?
Prompt-evaluation tools score the text a model returns from a single call. AgentClash evaluates multi-turn agents that take actions in a real sandbox and scores the whole trajectory — tool choices, cost, latency, and recovery — not just the final answer. See agentclash.dev/compare for a side-by-side.
Can I run AgentClash in CI?
Yes. AgentClash compares a candidate run against a baseline and fails CI when the candidate regresses on the scorecard or release gate you define. Failed runs can be promoted into permanent regression tests that replay in every future eval.
Is AgentClash open source, and can I self-host it?
Yes. AgentClash is open source under the MIT license. You can self-host the full stack or run against the hosted backend, and the CLI installs from npm as the agentclash package.
Which models and providers does AgentClash support?
300+ models via OpenRouter, plus first-class OpenAI, Anthropic, Gemini, xAI, Mistral, and OpenRouter providers. Tool calls are normalised to a single schema across providers so evals stay comparable.
How do I get started with AgentClash?
Install the CLI with npm install -g agentclash (or run against the hosted backend), then follow the quickstart to author a challenge pack and run your first agent eval.