AgentClash

Agent evals

Agent evals that cover the whole trajectory

Agent evals should answer whether an agent can finish the job again under the same tools, budget, and constraints. AgentClash records replay evidence and scorecards so evals become release decisions.

live race
gate: pass

Candidate

92correct patch, low cost

Baseline

88stable reference run

Control

73missed edge case

replay timeline

1loaded task inputs and tool policy
2ran sandbox actions and captured artifacts
3scored trajectory and validator evidence
4attached scorecard and release verdict

ci verdict

Candidate clears release gate

Correctness improved, latency within budget, and required artifacts were preserved for review.

agentclash run create --follow

What good agent evals capture

Built for reviewable agent decisions

If your agent eval only checks the final string, you miss the tool misuse, runaway loops, and artifact gaps that show up in production.

Sandboxed real-tool execution

Head-to-head runs with fair constraints

Scorecards for correctness, cost, latency, and tool strategy

Replay trails for every important action

Challenge packs that turn failures into reusable tests

CI gates for baseline versus candidate decisions

Workflow

From one eval to a reusable gate

Package the task

Describe the workload as a challenge pack with inputs, tools, scoring rules, and artifacts.

Race the agents

Run every candidate against the same task with the same constraints.

Replay the evidence

Inspect tool calls, outputs, artifacts, latency, cost, and judge evidence after the run.

Gate the release

Compare candidate and baseline runs, then fail CI before a regression reaches users.

Run your first agent eval

Bring your first workload into the loop

Use challenge packs for repeatable workloads, then promote failures into regression cases your team can run in CI.

FAQ

Agent eval FAQ

What is an agent eval?

An agent eval is a repeatable test that runs an agent on a task, scores the full trajectory, and compares the result to a baseline or competitor.

How is AgentClash different from prompt evals?

Prompt evals score one model response. AgentClash evals multi-turn agents that use tools in a sandbox and scores correctness, cost, latency, and evidence quality across the run.

Can agent evals run in CI?

Yes. AgentClash can compare candidate and baseline scorecards and fail a pull request when the configured release gate regresses.