Agent Skills

Regression Flywheel Skill

Use when inspecting AgentClash run failure-review items, promoting useful failures into regression suites, editing regression suites or cases, and verifying suite-only reruns.

Canonical source: web/content/agent-skills/agentclash-regression-flywheel/SKILL.md

Markdown export: /docs-md/agent-skills/agentclash-regression-flywheel

Use This Skill When

Use when inspecting AgentClash run failure-review items, promoting useful failures into regression suites, editing regression suites or cases, and verifying suite-only reruns.

Full SKILL.md

markdown
1---
2name: agentclash-regression-flywheel
3description: Use when inspecting AgentClash run failure-review items, promoting useful failures into regression suites, editing regression suites or cases, and verifying suite-only reruns.
4metadata:
5  agentclash.role: regression
6  agentclash.version: "1"
7  agentclash.requires_cli: "true"
8---
9
10# AgentClash Regression Flywheel
11
12## Purpose
13Turn understood AgentClash failures into durable regression coverage, then verify the promoted cases with suite-only runs.
14
15## Use When
16- A user wants to inspect failure-review items and decide which failures should become regression cases.
17- A failure item has `promotable: true` and a useful `promotion_mode_available`.
18- A regression suite needs to be created, renamed, archived, reactivated, or used for verification.
19- A regression case needs title, description, status, or severity cleanup after promotion.
20- A fix needs to be checked against a targeted regression suite or case.
21
22## Do Not Use When
23- The run has not produced failure evidence yet; use `agentclash-eval-runner` to run or follow it.
24- The user only needs to interpret a scorecard, replay, artifact, or ranking; use `agentclash-scorecard-reader` first.
25- The challenge pack itself needs authoring, validation, or publishing; use the challenge-pack skills.
26- The task is to configure release gates or CI promotion policy; use `agentclash-ci-release-gate`.
27
28## Inputs Needed
29- Workspace ID or configured workspace context.
30- Run ID containing failure-review items.
31- Source challenge pack ID for the target regression suite.
32- Target suite ID, or the suite name/details needed to create one.
33- Failure selector: `challenge_identity_id` from `run failures --json`, plus `run_agent_id` when more than one agent failed the same challenge.
34- Promotion mode from the failure item's `promotion_mode_available`: `full_executable` or `output_only`.
35- Case title, optional failure summary, optional severity, and any validator overrides.
36- Deployment and challenge pack version IDs/selectors for a suite-only verification run.
37
38## Environment
39Use hosted production by default unless the user intentionally targets local or self-hosted infrastructure:
40
41```bash
42export AGENTCLASH_API_URL="https://api.agentclash.dev"
43agentclash auth status
44agentclash workspace use <WORKSPACE_ID>
45```
46
47All commands in this skill require workspace context. Workspace resolution follows the CLI setup rules: `--workspace`, `AGENTCLASH_WORKSPACE`, saved config, or `.agentclash.yaml`.
48
49## Procedure
501. Read failure-review items for the run and group them by `failure_cluster_key`, `severity`, `failure_class`, and `promotable`.
512. Use `agentclash-scorecard-reader` evidence first: confirm the failed dimensions, judge/validator refs, replay refs, and artifact refs before promotion.
523. Choose an existing active suite whose `source_challenge_pack_id` matches the run source pack, or create one for that source pack.
534. Check for duplicates in the target suite by source failure cluster, failure fingerprint, challenge key, case key, and existing active/proposed cases.
545. Promote the failure with `run promote-failure <RUN_ID> <CHALLENGE_IDENTITY_ID>`.
556. Review the generated case JSON and update title, description, status, or severity if needed.
567. Run a suite-only verification against the updated deployment and report pass/fail coverage.
57
58## Inspect Failures
59Start with failure-review items:
60
61```bash
62agentclash run failures <RUN_ID> --json
63agentclash run failures <RUN_ID> --agent <RUN_AGENT_ID> --json
64agentclash run failures <RUN_ID> --severity blocking --json
65agentclash run failures <RUN_ID> --class policy_violation --json
66agentclash run failures <RUN_ID> --cluster <FAILURE_CLUSTER_KEY> --limit 50 --json
67```
68
69Supported filters are:
70
71- `--agent <RUN_AGENT_ID>`
72- `--severity info|warning|blocking`
73- `--class <failure_class>`
74- `--evidence-tier none|native_structured|hosted_structured|hosted_black_box|derived_summary`
75- `--cluster <FAILURE_CLUSTER_KEY>`
76- `--cursor <NEXT_CURSOR>`
77- `--limit <COUNT>`
78
79Failure classes currently accepted by the API are `incorrect_final_output`, `tool_selection_error`, `tool_argument_error`, `retrieval_grounding_failure`, `policy_violation`, `timeout_or_budget_exhaustion`, `sandbox_failure`, `dependency_resolution_failure`, `malformed_output`, `flaky_non_deterministic`, `insufficient_evidence`, and `other`.
80
81The fields that matter for promotion are:
82
83```json
84{
85  "items": [
86    {
87      "run_id": "<RUN_ID>",
88      "run_agent_id": "<RUN_AGENT_ID>",
89      "challenge_identity_id": "<CHALLENGE_IDENTITY_ID>",
90      "challenge_key": "<challenge_key>",
91      "case_key": "<case_key>",
92      "item_key": "<item_key>",
93      "failure_fingerprint": "frf_...",
94      "failure_cluster_key": "frc_...",
95      "failure_state": "failed",
96      "failed_dimensions": ["correctness"],
97      "failed_checks": ["<validator_or_judge_key>"],
98      "failure_class": "policy_violation",
99      "headline": "<headline>",
100      "detail": "<detail>",
101      "recommended_action": "<recommended action>",
102      "promotable": true,
103      "promotion_mode_available": ["full_executable", "output_only"],
104      "replay_step_refs": [],
105      "artifact_refs": [],
106      "judge_refs": [],
107      "metric_refs": [],
108      "evidence_tier": "hosted_structured",
109      "severity": "blocking"
110    }
111  ],
112  "clusters": [
113    {
114      "failure_cluster_key": "frc_...",
115      "representative_failure_fingerprint": "frf_...",
116      "count": 2,
117      "promotable_count": 1,
118      "severity": "blocking",
119      "failure_state": "failed",
120      "failure_class": "policy_violation",
121      "evidence_tier": "hosted_structured",
122      "challenge_keys": ["<challenge_key>"],
123      "case_keys": ["<case_key>"],
124      "run_agent_ids": ["<RUN_AGENT_ID>"],
125      "headline": "<headline>",
126      "recommended_action": "<recommended action>"
127    }
128  ],
129  "next_cursor": "<cursor>"
130}
131```
132
133Promote only when `promotable` is true and the chosen `promotion_mode` appears in `promotion_mode_available`.
134
135## Manage Suites
136List and inspect suites:
137
138```bash
139agentclash regression-suite list --json
140agentclash regression-suite get <SUITE_ID> --json
141agentclash regression-suite cases <SUITE_ID> --json
142```
143
144`regression-suite` also has the alias `regression-suites`.
145
146Create a suite:
147
148```bash
149agentclash regression-suite create \
150  --source-challenge-pack-id <CHALLENGE_PACK_ID> \
151  --name "Checkout regressions" \
152  --description "Failures promoted from checkout evals" \
153  --default-gate-severity warning \
154  --json
155```
156
157Equivalent `--from-file` payload:
158
159```json
160{
161  "source_challenge_pack_id": "<CHALLENGE_PACK_ID>",
162  "name": "Checkout regressions",
163  "description": "Failures promoted from checkout evals",
164  "default_gate_severity": "warning"
165}
166```
167
168Exact suite create rules:
169
170- `source_challenge_pack_id` is required and must identify a challenge pack visible to the workspace.
171- `name` is required.
172- `default_gate_severity` is optional and defaults to `warning`.
173- Allowed severities are `info`, `warning`, and `blocking`.
174- New suites are created with `status: "active"` and `source_mode: "derived_only"`.
175
176Update a suite:
177
178```bash
179agentclash regression-suite update <SUITE_ID> \
180  --name "Checkout regressions" \
181  --description "Current production blockers" \
182  --status active \
183  --default-gate-severity blocking \
184  --json
185```
186
187Equivalent `--from-file` payload:
188
189```json
190{
191  "name": "Checkout regressions",
192  "description": "Current production blockers",
193  "status": "active",
194  "default_gate_severity": "blocking"
195}
196```
197
198Exact suite update rules:
199
200- At least one field must be provided.
201- `status` must be `active` or `archived`.
202- `default_gate_severity` must be `info`, `warning`, or `blocking`.
203- Archived suites cannot accept new promotions.
204
205Suite JSON includes:
206
207```json
208{
209  "id": "<SUITE_ID>",
210  "workspace_id": "<WORKSPACE_ID>",
211  "source_challenge_pack_id": "<CHALLENGE_PACK_ID>",
212  "name": "Checkout regressions",
213  "description": "Current production blockers",
214  "status": "active",
215  "source_mode": "derived_only",
216  "default_gate_severity": "blocking",
217  "case_count": 3,
218  "created_by_user_id": "<USER_ID>",
219  "created_at": "<timestamp>",
220  "updated_at": "<timestamp>"
221}
222```
223
224`regression-suite list --json` prints `{ "items": [...] }` from the CLI. It does not expose the API's `total`, `limit`, or `offset` fields today.
225
226## Promote Failures
227The promotion command shape is:
228
229```bash
230agentclash run promote-failure <RUN_ID> <CHALLENGE_IDENTITY_ID> \
231  --run-agent <RUN_AGENT_ID> \
232  --suite <SUITE_ID> \
233  --promotion-mode full_executable \
234  --title "Policy answer must refuse credential disclosure" \
235  --failure-summary "Agent disclosed a credential-like value instead of refusing." \
236  --severity blocking \
237  --json
238```
239
240Important exact details:
241
242- The second positional argument is `challenge_identity_id` from `run failures --json`, not `failure_fingerprint` or `failure_cluster_key`.
243- Pass `--run-agent` when the same challenge identity failed for multiple run agents; otherwise the backend returns `failure_review_item_ambiguous`.
244- `--suite`, `--promotion-mode`, and `--title` map to required JSON fields.
245- `--promotion-mode` should be `full_executable` or `output_only`, and it must be present in the failure item's `promotion_mode_available`.
246- `--severity` is optional. If omitted, `policy_violation` and `sandbox_failure` default to `blocking`; other failure classes default to `warning`.
247- The CLI has no `--status`, `--validator-overrides`, or `--metadata` flags for promotion. Use `--from-file` for those fields.
248
249Full `--from-file` payload:
250
251```json
252{
253  "run_agent_id": "<RUN_AGENT_ID>",
254  "suite_id": "<SUITE_ID>",
255  "promotion_mode": "full_executable",
256  "title": "Policy answer must refuse credential disclosure",
257  "failure_summary": "Agent disclosed a credential-like value instead of refusing.",
258  "status": "proposed",
259  "severity": "blocking",
260  "validator_overrides": {
261    "judge_threshold_overrides": {
262      "policy_refusal": 0.9
263    },
264    "assertion_toggles": {
265      "must_refuse": true
266    }
267  },
268  "metadata": {
269    "source": "triage",
270    "source_challenge_key": "<challenge_key>",
271    "source_failure_fingerprint": "frf_...",
272    "source_failure_cluster_key": "frc_..."
273  }
274}
275```
276
277Exact promotion rules:
278
279- `suite_id` is required.
280- `title` is required.
281- `status`, when provided, must be `active` or `proposed`.
282- `severity`, when provided, must be `info`, `warning`, or `blocking`.
283- `validator_overrides` may contain only `judge_threshold_overrides` and `assertion_toggles`.
284- `metadata` must be a JSON object or null.
285- If you want `source_challenge_key`, `source_failure_fingerprint`, or `source_failure_cluster_key` on the case response for duplicate checks, include those exact keys in `metadata`.
286- The target suite must be active and must have the same `source_challenge_pack_id` as the run source pack.
287- The failure item must be promotable. Items without a challenge input set or with insufficient reproduction context may have no available promotion modes.
288
289`run promote-failure --json` prints the regression case object directly. The HTTP status is 201 when a case is created and 200 when the same suite, run agent, and challenge identity already map to an existing case; the CLI JSON output is the case in both paths.
290
291## Review and Edit Cases
292List cases in a suite:
293
294```bash
295agentclash regression-suite cases <SUITE_ID> --json
296```
297
298Update a case:
299
300```bash
301agentclash regression-suite case update <CASE_ID> \
302  --title "Policy answer must refuse credential disclosure" \
303  --description "Covers credential disclosure requests in support chat." \
304  --status active \
305  --severity blocking \
306  --json
307```
308
309Equivalent `--from-file` payload:
310
311```json
312{
313  "title": "Policy answer must refuse credential disclosure",
314  "description": "Covers credential disclosure requests in support chat.",
315  "status": "active",
316  "severity": "blocking"
317}
318```
319
320Exact case update rules:
321
322- At least one field must be provided.
323- `status` must be `proposed`, `active`, `muted`, `archived`, or `rejected`.
324- `severity` must be `info`, `warning`, or `blocking`.
325- There is no CLI command today to create a regression case directly, fetch a single case directly, or patch `expected_contract`, `payload_snapshot`, `validator_overrides`, or `metadata` after promotion.
326
327Case JSON includes:
328
329```json
330{
331  "id": "<CASE_ID>",
332  "suite_id": "<SUITE_ID>",
333  "workspace_id": "<WORKSPACE_ID>",
334  "title": "Policy answer must refuse credential disclosure",
335  "description": "Covers credential disclosure requests in support chat.",
336  "status": "active",
337  "severity": "blocking",
338  "promotion_mode": "full_executable",
339  "source_run_id": "<RUN_ID>",
340  "source_run_agent_id": "<RUN_AGENT_ID>",
341  "source_replay_id": "<REPLAY_ID>",
342  "source_challenge_pack_version_id": "<CHALLENGE_PACK_VERSION_ID>",
343  "source_challenge_input_set_id": "<INPUT_SET_ID>",
344  "source_challenge_identity_id": "<CHALLENGE_IDENTITY_ID>",
345  "source_challenge_key": "<challenge_key>",
346  "source_case_key": "<case_key>",
347  "source_item_key": "<item_key>",
348  "source_failure_fingerprint": "frf_...",
349  "source_failure_cluster_key": "frc_...",
350  "evidence_tier": "hosted_structured",
351  "failure_class": "policy_violation",
352  "failure_summary": "<summary>",
353  "payload_snapshot": {},
354  "expected_contract": {},
355  "validator_overrides": {},
356  "metadata": {},
357  "latest_promotion": {
358    "id": "<PROMOTION_ID>",
359    "workspace_regression_case_id": "<CASE_ID>",
360    "source_run_id": "<RUN_ID>",
361    "source_run_agent_id": "<RUN_AGENT_ID>",
362    "source_event_refs": [],
363    "promoted_by_user_id": "<USER_ID>",
364    "promotion_reason": "<summary>",
365    "promotion_snapshot": {},
366    "created_at": "<timestamp>"
367  },
368  "validation": {
369    "status": "not_validated",
370    "run_count": 0,
371    "failure_count": 0,
372    "pass_count": 0,
373    "reproduction_threshold": 0.6,
374    "required_runs": 5,
375    "remaining_runs": 5,
376    "recommended_action": "<action>"
377  },
378  "created_at": "<timestamp>",
379  "updated_at": "<timestamp>"
380}
381```
382
383Validation status values are `not_validated`, `collecting_signal`, `reproducing`, `passing`, and `flaky`.
384
385## Duplicate and Quality Checks
386Before promotion:
387
388- Compare the target suite's existing cases by `source_case_key`, `status`, and any available `source_failure_cluster_key`, `source_failure_fingerprint`, or `source_challenge_key` fields.
389- Prefer updating or reusing an existing active/proposed case when a failure is the same behavior, even if it came from a different run.
390- Promote only failures with concrete replay, judge, validator, metric, or artifact evidence. Avoid promoting `insufficient_evidence` unless the goal is explicitly to track missing evidence.
391- Use `full_executable` when the failure has a challenge input set and enough structured evidence to replay the case. Use `output_only` when only the final output contract can be captured.
392
393Backend duplicate protection is intentionally narrow: the same suite, run agent, and challenge identity returns the existing case. Cross-run duplicates and cross-suite duplicates are reviewer decisions.
394
395## Verify Suite-Only
396Use `eval start` when selectors can be names, slugs, or exact suite names:
397
398```bash
399agentclash eval start \
400  --pack <PACK_ID_OR_SLUG_OR_EXACT_NAME> \
401  --pack-version <VERSION_ID_OR_VERSION_NUMBER> \
402  --deployment <DEPLOYMENT_ID_OR_EXACT_NAME> \
403  --scope suite_only \
404  --suite <SUITE_ID_OR_EXACT_NAME> \
405  --follow
406```
407
408Use `run create` when automation already has IDs:
409
410```bash
411agentclash run create \
412  --challenge-pack-version <CHALLENGE_PACK_VERSION_ID> \
413  --deployments <AGENT_DEPLOYMENT_ID> \
414  --scope suite_only \
415  --suite <SUITE_ID> \
416  --case <CASE_ID> \
417  --follow
418```
419
420Exact suite-only notes:
421
422- `--scope suite_only` requires at least one `--suite` or `--case`.
423- In `eval start`, `--suite` can resolve a suite ID or exact suite name; `--case` is a case ID.
424- In `run create`, `--suite` and `--case` are ID-first.
425- `--repetitions >= 2` does not support `--scope suite_only`, `--suite`, or `--case`.
426- After the run, inspect `agentclash run get <RUN_ID> --json` for `regression_coverage`.
427
428`regression_coverage` contains:
429
430```json
431{
432  "regression_coverage": {
433    "suites": [
434      {
435        "id": "<SUITE_ID>",
436        "name": "Checkout regressions",
437        "case_count": 3,
438        "pass_count": 2,
439        "fail_count": 1
440      }
441    ],
442    "unmatched_cases": [
443      {
444        "id": "<CASE_ID>",
445        "title": "<case title>",
446        "outcome": "fail"
447      }
448    ]
449  }
450}
451```
452
453Then inspect:
454
455```bash
456agentclash run failures <VERIFICATION_RUN_ID> --json
457agentclash eval scorecard <VERIFICATION_RUN_ID> --agent <RUN_AGENT_ID_OR_LABEL> --json
458agentclash run ranking <VERIFICATION_RUN_ID> --json
459```
460
461## Expected Output
462- A small set of promoted cases with clear source evidence, status, severity, suite, and promotion mode.
463- No duplicate active/proposed cases for the same behavior in the target suite.
464- A suite-only verification run ID and result.
465- A concise explanation of whether the fix passes, fails, or needs more validation runs.
466
467## Failure Modes
468- Missing workspace: run `agentclash link`, `agentclash workspace use <id>`, pass `--workspace`, or set `AGENTCLASH_WORKSPACE`.
469- `source_challenge_pack_id is required`: create the suite with the source challenge pack ID, not a challenge pack version ID.
470- `challenge_pack_not_found`: the source challenge pack is not visible to the workspace.
471- `regression_suite_name_conflict`: rename the suite or reuse the existing active suite.
472- `regression_suite_archived`: reactivate the suite or pick an active one.
473- `regression_suite_pack_mismatch`: choose a suite whose `source_challenge_pack_id` matches the run source pack.
474- `failure_review_item_not_found`: use the `challenge_identity_id` from `run failures --json`, not the fingerprint or cluster key.
475- `failure_review_item_ambiguous`: pass `--run-agent <RUN_AGENT_ID>`.
476- `failure_not_promotable`: do not promote; collect better evidence or run with a challenge input set.
477- `promotion_mode_unavailable`: choose a mode listed in `promotion_mode_available`.
478- `invalid_promotion_overrides`: use only `judge_threshold_overrides` and `assertion_toggles` with the correct map value types.
479- `--scope suite_only requires at least one --suite or --case`: add a suite or case selector.
480
481## Safety Notes
482- Promotion and suite/case updates mutate shared workspace state. Confirm intent before changing production suites.
483- Do not put secrets, customer data, raw artifact contents, or long traces into case titles, summaries, descriptions, metadata, or chat.
484- Prefer `status: "proposed"` when a reviewer still needs to approve the case.
485- Archive or reject noisy cases instead of leaving weak regressions active.
486- Keep suite-only verification focused; avoid broad full-pack reruns when a targeted suite is enough.
487
488## Report Back Format
489```text
490Run: <RUN_ID>
491Failure reviewed:
492- challenge_identity_id=<id> run_agent_id=<id> cluster=<frc_...> class=<failure_class> severity=<severity>
493Suite: <SUITE_ID> (<name>)
494Duplicate check: <none found | reused CASE_ID | updated CASE_ID>
495Promotion:
496- case=<CASE_ID> mode=<full_executable|output_only> status=<proposed|active> severity=<severity>
497Case edits: <none | title/description/status/severity changes>
498Verification:
499- command=<exact suite-only command>
500- run=<VERIFICATION_RUN_ID>
501- regression_coverage=<pass/fail counts or unavailable>
502Next action: <ship/fix/rerun/needs-review>
503```
504
505## Related Skills
506- `agentclash-hub`
507- `agentclash-cli-setup`
508- `agentclash-eval-runner`
509- `agentclash-scorecard-reader`
510- `agentclash-compare-and-triage`
511- `agentclash-ci-release-gate`
512
513## Related Docs
514- `/docs-md/concepts/replay-and-scorecards`
515- `/docs-md/concepts/runs-and-evals`
516- `/docs-md/reference/cli`