Open source
Open source AI agent evaluation for real tasks
AgentClash is MIT-licensed and self-hostable. Run head-to-head agent races on your workloads, keep replay evidence, and turn failures into reusable regression gates without a black-box vendor loop.
Candidate
Baseline
Control
replay timeline
ci verdict
Correctness improved, latency within budget, and required artifacts were preserved for review.
agentclash run create --follow
What open-source eval should include
Built for reviewable agent decisions
Open source matters when your eval stack needs to run inside your repo, your CI, and your sandbox policy — not just inside someone else's hosted UI.
Sandboxed real-tool execution
Head-to-head runs with fair constraints
Scorecards for correctness, cost, latency, and tool strategy
Replay trails for every important action
Challenge packs that turn failures into reusable tests
CI gates for baseline versus candidate decisions
Workflow
From local race to team-wide gate
Package the task
Describe the workload as a challenge pack with inputs, tools, scoring rules, and artifacts.
Race the agents
Run every candidate against the same task with the same constraints.
Replay the evidence
Inspect tool calls, outputs, artifacts, latency, cost, and judge evidence after the run.
Gate the release
Compare candidate and baseline runs, then fail CI before a regression reaches users.
Start with the repo
Bring your first workload into the loop
Clone AgentClash, boot the local stack, and wire the same challenge packs into hosted runs or pull request gates.
FAQ
Questions OSS teams ask before adopting
Is AgentClash actually open source?
Yes. AgentClash is MIT-licensed. You can self-host the API server, worker, and web app, or use the hosted product when that is faster for your team.
Can we run evals entirely on our own infrastructure?
Yes. AgentClash supports local stacks, self-hosted deployments, and sandbox providers you control. Challenge packs and scorecards work the same whether the run is local or hosted.
How does open source help with agent regression testing?
You can inspect scoring rules, pack definitions, replay artifacts, and CI gate manifests in git. That makes agent eval auditable instead of a dashboard-only black box.