AgentClash

Use case

Evaluate customer support agents on real resolutions

Support agents must resolve tickets safely, use the right tools, and leave an audit trail. AgentClash evaluates full support trajectories and keeps replay evidence when tone, policy, or resolution quality regresses.

live race
gate: pass

Candidate

92correct patch, low cost

Baseline

88stable reference run

Control

73missed edge case

replay timeline

1loaded task inputs and tool policy
2ran sandbox actions and captured artifacts
3scored trajectory and validator evidence
4attached scorecard and release verdict

ci verdict

Candidate clears release gate

Correctness improved, latency within budget, and required artifacts were preserved for review.

agentclash run create --follow

Support eval signals

Built for reviewable agent decisions

Score resolution correctness, policy adherence, tool usage, escalation behavior, and customer-facing artifact quality.

Sandboxed real-tool execution

Head-to-head runs with fair constraints

Scorecards for correctness, cost, latency, and tool strategy

Replay trails for every important action

Challenge packs that turn failures into reusable tests

CI gates for baseline versus candidate decisions

Workflow

Support eval workflow

Package the task

Describe the workload as a challenge pack with inputs, tools, scoring rules, and artifacts.

Race the agents

Run every candidate against the same task with the same constraints.

Replay the evidence

Inspect tool calls, outputs, artifacts, latency, cost, and judge evidence after the run.

Gate the release

Compare candidate and baseline runs, then fail CI before a regression reaches users.

Start with escaped tickets

Bring your first workload into the loop

Promote real support failures into challenge packs and regression suites so the same mistake cannot return after a model update.

FAQ

Support agent evaluation FAQ

Can AgentClash evaluate multi-turn support flows?

Yes. Multi-turn challenge packs support scripted, simulated, and human phases for realistic support conversations.

How do teams measure policy adherence?

Challenge packs encode required actions, forbidden tool use, and validator checks so scorecards reflect policy—not just friendly language.

Can support evals gate releases?

Yes. Compare candidate and baseline scorecards in CI before deploying a new support agent or model route.