Agent evals
Agent evals that cover the whole trajectory
Agent evals should answer whether an agent can finish the job again under the same tools, budget, and constraints. AgentClash records replay evidence and scorecards so evals become release decisions.
Candidate
Baseline
Control
replay timeline
ci verdict
Correctness improved, latency within budget, and required artifacts were preserved for review.
agentclash run create --follow
What good agent evals capture
Built for reviewable agent decisions
If your agent eval only checks the final string, you miss the tool misuse, runaway loops, and artifact gaps that show up in production.
Sandboxed real-tool execution
Head-to-head runs with fair constraints
Scorecards for correctness, cost, latency, and tool strategy
Replay trails for every important action
Challenge packs that turn failures into reusable tests
CI gates for baseline versus candidate decisions
Workflow
From one eval to a reusable gate
Package the task
Describe the workload as a challenge pack with inputs, tools, scoring rules, and artifacts.
Race the agents
Run every candidate against the same task with the same constraints.
Replay the evidence
Inspect tool calls, outputs, artifacts, latency, cost, and judge evidence after the run.
Gate the release
Compare candidate and baseline runs, then fail CI before a regression reaches users.
Run your first agent eval
Bring your first workload into the loop
Use challenge packs for repeatable workloads, then promote failures into regression cases your team can run in CI.
FAQ
Agent eval FAQ
What is an agent eval?
An agent eval is a repeatable test that runs an agent on a task, scores the full trajectory, and compares the result to a baseline or competitor.
How is AgentClash different from prompt evals?
Prompt evals score one model response. AgentClash evals multi-turn agents that use tools in a sandbox and scores correctness, cost, latency, and evidence quality across the run.
Can agent evals run in CI?
Yes. AgentClash can compare candidate and baseline scorecards and fail a pull request when the configured release gate regresses.