Use case
Evaluate customer support agents on real resolutions
Support agents must resolve tickets safely, use the right tools, and leave an audit trail. AgentClash evaluates full support trajectories and keeps replay evidence when tone, policy, or resolution quality regresses.
Candidate
Baseline
Control
replay timeline
ci verdict
Correctness improved, latency within budget, and required artifacts were preserved for review.
agentclash run create --follow
Support eval signals
Built for reviewable agent decisions
Score resolution correctness, policy adherence, tool usage, escalation behavior, and customer-facing artifact quality.
Sandboxed real-tool execution
Head-to-head runs with fair constraints
Scorecards for correctness, cost, latency, and tool strategy
Replay trails for every important action
Challenge packs that turn failures into reusable tests
CI gates for baseline versus candidate decisions
Workflow
Support eval workflow
Package the task
Describe the workload as a challenge pack with inputs, tools, scoring rules, and artifacts.
Race the agents
Run every candidate against the same task with the same constraints.
Replay the evidence
Inspect tool calls, outputs, artifacts, latency, cost, and judge evidence after the run.
Gate the release
Compare candidate and baseline runs, then fail CI before a regression reaches users.
Start with escaped tickets
Bring your first workload into the loop
Promote real support failures into challenge packs and regression suites so the same mistake cannot return after a model update.
FAQ
Support agent evaluation FAQ
Can AgentClash evaluate multi-turn support flows?
Yes. Multi-turn challenge packs support scripted, simulated, and human phases for realistic support conversations.
How do teams measure policy adherence?
Challenge packs encode required actions, forbidden tool use, and validator checks so scorecards reflect policy—not just friendly language.
Can support evals gate releases?
Yes. Compare candidate and baseline scorecards in CI before deploying a new support agent or model route.