Feature
Agent replay that makes failures reviewable
When an agent gate fails, reviewers need the path — not a summary. AgentClash replay timelines show tool calls, observations, artifacts, and scorecard context in one place.
Candidate
Baseline
Control
replay timeline
ci verdict
Correctness improved, latency within budget, and required artifacts were preserved for review.
agentclash run create --follow
Why replay matters
Built for reviewable agent decisions
Replay turns agent eval from a score into an auditable story of what the agent tried and what the environment returned.
Sandboxed real-tool execution
Head-to-head runs with fair constraints
Scorecards for correctness, cost, latency, and tool strategy
Replay trails for every important action
Challenge packs that turn failures into reusable tests
CI gates for baseline versus candidate decisions
Workflow
Replay workflow
Package the task
Describe the workload as a challenge pack with inputs, tools, scoring rules, and artifacts.
Race the agents
Run every candidate against the same task with the same constraints.
Replay the evidence
Inspect tool calls, outputs, artifacts, latency, cost, and judge evidence after the run.
Gate the release
Compare candidate and baseline runs, then fail CI before a regression reaches users.
Debug with docs
Bring your first workload into the loop
Pair replay with interpret-results guidance and challenge packs so every debugged failure becomes a reusable test.
FAQ
Agent replay FAQ
What does AgentClash replay include?
Tool calls, observations, logs, artifacts, timing, and links to the scorecard dimensions that passed or failed.
Can replay be shared with reviewers?
Yes. Runs support shareable views so PMs, support leads, and engineers can review the same evidence.
How does replay help CI failures?
When a gate fails, replay shows exactly which step diverged from baseline so you do not guess from a single error string.