AgentClash platform
AI agent evaluation platform for real tasks
Race agents against the same workload with the same tools and same constraints. AgentClash captures replay evidence, scorecards, artifacts, and release-gate verdicts so evals turn into decisions instead of dashboard archaeology.
Codex
Claude
Gemini
replay timeline
ci verdict
Correctness +7, latency -11%, cost unchanged. No promoted failures exceeded the configured threshold.
agentclash ci run --manifest .agentclash/ci.yaml
What the platform evaluates
The whole agent, not just the final answer
AgentClash looks at the trajectory that produced the answer: tool choices, runtime behavior, artifacts, cost, latency, and whether the agent actually satisfied the task.
Sandboxed real-tool execution
Head-to-head runs with fair constraints
Scorecards for correctness, cost, latency, and tool strategy
Replay trails for every important action
Challenge packs that turn failures into reusable tests
CI gates for baseline versus candidate decisions
Workflow
From one failed task to a reusable gate
Package the task
Describe the workload as a challenge pack with inputs, tools, scoring rules, and artifacts.
Race the agents
Run every candidate against the same task at the same time with the same constraints.
Replay the evidence
Inspect tool calls, outputs, artifacts, latency, cost, and judge evidence after the run.
Gate the release
Compare candidate and baseline runs, then fail CI before a regression reaches users.
Start with docs
Bring your first workload into the loop
Use challenge packs for repeatable tasks, then wire the same workload into local runs, hosted runs, or pull request gates.
FAQ
Questions teams ask before replacing benchmark-only evals
What is an AI agent evaluation platform?
An AI agent evaluation platform runs agents against repeatable tasks, captures what they did, scores outcomes, and helps teams decide whether an agent or model is ready to ship.
How is AgentClash different from static benchmarks?
AgentClash runs your agents on your tasks with the same tools, constraints, and time budget, then records replay evidence and scorecards that can become regression gates.
Can AgentClash run in CI?
Yes. AgentClash can compare a candidate run against a baseline and fail CI when the candidate regresses on the scorecard or release gate you define.