Agent Skills
Compare And Triage Skill

Use when comparing baseline vs candidate AgentClash runs, evaluating release gates, managing workspace baseline bookmarks, or building a replay triage envelope after an eval completes.
Canonical source: web/content/agent-skills/agentclash-compare-and-triage/SKILL.md
Markdown export: /docs-md/agent-skills/agentclash-compare-and-triage
Use This Skill When

Use when comparing baseline vs candidate AgentClash runs, evaluating release gates, managing workspace baseline bookmarks, or building a replay triage envelope after an eval completes.
Full SKILL.md

markdown
1---
2name: agentclash-compare-and-triage
3description: Use when comparing baseline vs candidate AgentClash runs, evaluating release gates, managing workspace baseline bookmarks, or building a replay triage envelope after an eval completes.
4metadata:
5  agentclash.role: comparison
6  agentclash.version: "1"
7  agentclash.requires_cli: "true"
8---
9
10# AgentClash Compare And Triage
11
12## Purpose
13Manage workspace baselines, compare runs for regressions, evaluate release gates for CI, and assemble replay triage evidence (ranking, failures, scorecard, replay steps) in one workflow.
14
15## Use When
16- A user asks whether a new run regressed vs a baseline.
17- CI needs `compare gate` exit codes for pass/fail verdicts.
18- A user wants the fastest path: `compare latest` against the saved baseline bookmark.
19- After a run completes, the user needs structured triage with suggested follow-up commands.
20
21## Do Not Use When
22- No completed runs exist yet — use `agentclash-eval-runner` first.
23- The task is only deep scorecard interpretation without comparison — use `agentclash-scorecard-reader`.
24- The task is authoring CI manifest files from scratch — use `agentclash-ci-release-gate` (this skill covers CLI compare/gate/triage commands).
25
26## Inputs Needed
27- Workspace with at least one completed candidate run.
28- Baseline bookmark (`baseline set`) for `compare latest`, or explicit run IDs for `compare runs` / `compare gate`.
29- Optional run agent ID or label when runs have multiple agents.
30- For triage: run ID or selector; optional `--agent`, `--cursor`, `--limit`.
31
32## Environment
33```bash
34export AGENTCLASH_API_URL="https://api.agentclash.dev"
35agentclash workspace use <WORKSPACE_ID>
36agentclash baseline show
37```
38
39## Procedure
401. After a good eval, bookmark it: `agentclash baseline set [run] --agent <label>`.
412. Run new evals with `agentclash eval start` (see eval-runner skill).
423. Compare:
43   - Ad hoc: `agentclash compare runs --baseline <ID> --candidate <ID>`
44   - Fast path: `agentclash compare latest` (uses saved baseline vs latest non-baseline run)
45   - CI gate: `agentclash compare gate --baseline <ID> --candidate <ID>`
46   - Latest + gate: `agentclash compare latest --gate`
474. Triage evidence: `agentclash replay triage <run> [--agent <label>]`.
485. Follow `next_commands` from triage JSON for deeper replay or scorecard reads.
49
50## Commands
51
52### Baseline bookmark (workspace-scoped)
53```bash
54agentclash baseline set [run]
55agentclash baseline set [run] --agent <RUN_AGENT_ID_OR_LABEL>
56agentclash baseline show
57agentclash baseline clear
58```
59
60- `baseline set` with no run opens an interactive picker in a TTY.
61- Bookmark stores run ID, run agent ID, names, and timestamp in CLI config.
62- `compare latest` reads this bookmark as the baseline side.
63
64### Compare runs
65```bash
66agentclash compare runs \
67  --baseline <BASELINE_RUN_ID> \
68  --candidate <CANDIDATE_RUN_ID> \
69  --baseline-agent <RUN_AGENT_ID_OR_LABEL> \
70  --candidate-agent <RUN_AGENT_ID_OR_LABEL>
71
72agentclash run compare \
73  --baseline <BASELINE_RUN_ID> \
74  --candidate <CANDIDATE_RUN_ID>
75```
76
77Shared comparison flags (both `compare runs` and `run compare`):
78
79- `--baseline` (required)
80- `--candidate` (required)
81- `--baseline-agent` — optional; defaults to first agent or saved baseline agent when applicable
82- `--candidate-agent` — optional
83
84### Compare latest (baseline bookmark vs newest run)
85```bash
86agentclash compare latest
87agentclash compare latest --gate
88agentclash compare latest --agent <RUN_AGENT_ID_OR_LABEL>
89agentclash compare latest --baseline-agent <ID_OR_LABEL> --candidate-agent <ID_OR_LABEL>
90agentclash compare latest --json
91```
92
93- Requires a saved baseline bookmark unless baseline run is inferable from flags.
94- `--gate` evaluates release gate rules and returns nonzero exit for non-pass verdicts (same as `compare gate`).
95- Structured output includes comparison envelope and optional `release_gate` object.
96
97### Compare gate (explicit IDs, CI-friendly exit code)
98```bash
99agentclash compare gate \
100  --baseline <BASELINE_RUN_ID> \
101  --candidate <CANDIDATE_RUN_ID> \
102  --baseline-agent <RUN_AGENT_ID_OR_LABEL> \
103  --candidate-agent <RUN_AGENT_ID_OR_LABEL>
104```
105
106- `--baseline` and `--candidate` are required.
107- Non-pass gate verdicts exit nonzero for shell/CI scripts.
108
109### Replay triage envelope
110```bash
111agentclash replay triage <RUN_ID_OR_SELECTOR>
112agentclash replay triage <RUN_ID> --agent <RUN_AGENT_ID_OR_LABEL>
113agentclash replay triage <RUN_ID> --cursor 0 --limit 5
114agentclash replay triage <RUN_ID> --json
115```
116
117Flags:
118
119- `--agent` — run agent ID or label; required in non-interactive mode when multiple agents exist.
120- `--cursor` — replay step offset (default 0).
121- `--limit` — steps to include, 1–50 (default 5).
122
123Triage envelope includes:
124
125- `run`, `agents`, `selected_agent`, `ranking`, `failures`, `artifacts`
126- `scorecard` and `replay` when an agent is selected
127- `next_commands` — suggested follow-ups (e.g. deeper replay, scorecard, compare)
128
129## Expected Output
130- **Compare** — human tables or JSON with candidate/baseline metrics, deltas, and optional `release_gate.verdict`.
131- **compare latest --gate** — prints comparison then exits 1 on gate failure.
132- **replay triage** — consolidated evidence bundle; use `--json` for automation.
133
134## Failure Modes
135- No baseline bookmark for `compare latest` → run `agentclash baseline set` on a known-good run.
136- No candidate run newer than baseline → create a new eval first.
137- Multiple run agents without `--agent` on triage → pass `--agent` or use interactive TTY.
138- Gate pending scorecard → wait for run completion; check `agentclash run get <id>`.
139- Invalid agent selector → list agents with `agentclash run agents <run_id> --json`.
140
141## Safety Notes
142- Comparisons are read-only but may surface sensitive failure excerpts — do not paste into public channels.
143- Gate failures should block release; confirm with the user before overriding CI exit codes.
144
145## Report Back Format
146```text
147Baseline: <run_id> / agent <id or label>
148Candidate: <run_id> / agent <id or label>
149Compare command: <command used>
150Gate verdict: <pass|fail|pending|n/a>
151Key deltas: <summary>
152Triage agent: <selected agent>
153Failures: <count / top class>
154Next commands:
155- <from triage envelope>
156Recommendation: <ship|investigate|rerun>
157```
158
159## Related Skills
160- `agentclash-hub`
161- `agentclash-eval-runner`
162- `agentclash-scorecard-reader`
163- `agentclash-ci-release-gate`
164- `agentclash-regression-flywheel`
165
166## Related Docs
167- `/docs-md/guides/interpret-results`
168- `/docs-md/guides/ci-cd-agent-gates`
169- `/docs-md/concepts/replay-and-scorecards`
170- `/docs-md/reference/cli`
PreviousScorecard Reader Skill NextRegression Flywheel Skill