Agent Skills

Scorecard Reader Skill

Use when interpreting AgentClash rankings, scorecards, replay timelines, artifacts, LLM judge results, or failure-review evidence into source-backed findings and next actions.

Canonical source: web/content/agent-skills/agentclash-scorecard-reader/SKILL.md

Markdown export: /docs-md/agent-skills/agentclash-scorecard-reader

Use This Skill When

Use when interpreting AgentClash rankings, scorecards, replay timelines, artifacts, LLM judge results, or failure-review evidence into source-backed findings and next actions.

Full SKILL.md

markdown
1---
2name: agentclash-scorecard-reader
3description: Use when interpreting AgentClash rankings, scorecards, replay timelines, artifacts, LLM judge results, or failure-review evidence into source-backed findings and next actions.
4metadata:
5  agentclash.role: reviewing
6  agentclash.version: "1"
7  agentclash.requires_cli: "true"
8---
9
10# AgentClash Scorecard Reader
11
12## Purpose
13Turn completed or inspectable AgentClash run evidence into an engineering readout: who won, why, which claims are backed by scorecard/replay/artifact evidence, and what the next command or fix should be.
14
15## Use When
16- A user asks why an AgentClash run passed, failed, regressed, drifted, or picked a winner.
17- You have a run ID and need to inspect rankings, run agents, scorecards, replay steps, failure-review items, or artifacts.
18- A reviewer needs evidence-first findings instead of raw JSON dumps.
19- A follow-up skill needs a grounded summary before promoting regressions or changing a challenge pack.
20
21## Do Not Use When
22- The user needs to start a new eval run; use `agentclash-eval-runner`.
23- The user needs to author, validate, or publish the challenge pack first; use the challenge-pack skills.
24- The user is ready to promote failures into regression suites; use `agentclash-regression-flywheel` after this skill identifies the useful failures.
25
26## Inputs Needed
27- Workspace ID or configured workspace context.
28- Run ID.
29- Optional run agent ID or agent label for a specific scorecard or replay.
30- Optional baseline expectation, expected winner, or release-gate decision to compare against.
31- Optional artifact IDs from failure evidence, scorecards, or workspace artifact list.
32
33## Environment
34Use hosted production by default unless the user intentionally targets local or self-hosted infrastructure:
35
36```bash
37export AGENTCLASH_API_URL="https://api.agentclash.dev"
38agentclash auth status
39agentclash workspace use <WORKSPACE_ID>
40```
41
42Workspace resolution follows the CLI setup rules: `--workspace`, `AGENTCLASH_WORKSPACE`, saved config, or `.agentclash.yaml`. `run failures`, `artifact list`, and `eval scorecard` require a workspace. `artifact download` uses an artifact ID directly.
43
44## Procedure
451. Confirm the run exists and get its agent IDs.
462. Read the ranking to identify the winner, sort mode, gaps, unavailable scores, and evidence warnings.
473. Read the relevant scorecard. Use `eval scorecard` for run-first analysis and baseline comparison; use `run scorecard` when you already have a run agent ID.
484. Read failure-review items for concrete failed checks, replay refs, artifact refs, judge refs, metric refs, severity, and promotability.
495. Pull replay steps around referenced sequences, or page through replay when no refs exist.
506. Inspect artifacts only when a scorecard, failure item, or user request points to them.
517. Report claims as evidence-first findings: claim, evidence pointer, impact, next action.
52
53## Commands
54Start with run-level shape:
55
56```bash
57agentclash run get <RUN_ID> --json
58agentclash run agents <RUN_ID> --json
59agentclash run ranking <RUN_ID> --json
60agentclash run ranking <RUN_ID> --sort-by composite --json
61agentclash run ranking <RUN_ID> --sort-by correctness --json
62```
63
64Scorecard commands:
65
66```bash
67agentclash eval scorecard <RUN_ID> --agent <RUN_AGENT_ID_OR_LABEL> --json
68agentclash eval scorecard --agent <RUN_AGENT_ID_OR_LABEL> --json
69agentclash run scorecard <RUN_AGENT_ID> --json
70```
71
72Replay and failure-review commands:
73
74```bash
75agentclash run failures <RUN_ID> --json
76agentclash run failures <RUN_ID> --agent <RUN_AGENT_ID> --severity blocking --json
77agentclash run failures <RUN_ID> --class policy_violation --evidence-tier hosted_structured --json
78agentclash run failures <RUN_ID> --cluster <FAILURE_CLUSTER_KEY> --limit 50 --json
79agentclash replay get <RUN_AGENT_ID> --limit 50 --json
80agentclash replay get <RUN_AGENT_ID> --cursor 50 --limit 50 --json
81```
82
83Artifact commands:
84
85```bash
86agentclash artifact list --json
87agentclash artifact download <ARTIFACT_ID> --output <PATH>
88```
89
90Important exact CLI shapes:
91
92- `run scorecard` takes one argument: `<RUN_AGENT_ID>`. It does not accept `<RUN_ID> <RUN_AGENT_ID>`.
93- `replay get` takes one argument: `<RUN_AGENT_ID>`. It does not accept `<RUN_ID> <RUN_AGENT_ID>`.
94- `artifact list` is workspace-wide. It does not have a `--run` filter today; use `artifact_refs`, `run_id`, or `run_agent_id` fields from JSON output to choose what to download.
95- `run ranking --sort-by` commonly uses `composite`, `correctness`, `reliability`, `latency`, or `cost`. The backend also accepts a custom dimension key when that key exists in the scorecard dimensions; unknown keys return `invalid_sort_by`.
96- `run failures --limit` defaults to 50 when omitted and is capped at 200 by the API.
97
98## Ranking JSON
99`agentclash run ranking <RUN_ID> --json` returns a stateful response:
100
101```json
102{
103  "state": "ready",
104  "ranking": {
105    "run_id": "<RUN_ID>",
106    "evaluation_spec_id": "<EVALUATION_SPEC_ID>",
107    "sort": {
108      "field": "correctness_then_reliability",
109      "direction": "desc",
110      "default_order": true
111    },
112    "winner": {
113      "run_agent_id": "<RUN_AGENT_ID>",
114      "strategy": "<strategy>",
115      "status": "<status>",
116      "reason_code": "<reason_code>"
117    },
118    "evidence_quality": {
119      "missing_fields": [],
120      "warnings": []
121    },
122    "items": [
123      {
124        "rank": 1,
125        "run_agent_id": "<RUN_AGENT_ID>",
126        "lane_index": 0,
127        "label": "<agent label>",
128        "status": "completed",
129        "has_scorecard": true,
130        "evaluation_status": "complete",
131        "sort_value": 0.92,
132        "delta_from_top": 0,
133        "sort_state": "available",
134        "strategy": "<strategy>",
135        "passed": true,
136        "overall_reason": "<reason>",
137        "composite_score": 0.91,
138        "overall_score": 0.91,
139        "correctness_score": 0.95,
140        "reliability_score": 0.9,
141        "latency_score": 0.8,
142        "cost_score": 0.7,
143        "dimensions": {
144          "correctness": {
145            "state": "available",
146            "score": 0.95,
147            "better_direction": "higher"
148          }
149        }
150      }
151    ]
152  }
153}
154```
155
156Read `evidence_quality.warnings` before declaring a winner as conclusive. A low `sort_value`, missing `rank`, `sort_state: "unavailable"`, `has_scorecard: false`, or missing score fields means the ranking may be partial even if the run itself completed.
157
158## Scorecard JSON
159Use `agentclash run scorecard <RUN_AGENT_ID> --json` when you already know the agent ID. The top level mirrors `/v1/scorecards/{runAgentID}`:
160
161```json
162{
163  "state": "ready",
164  "run_agent_status": "completed",
165  "id": "<SCORECARD_ID>",
166  "run_agent_id": "<RUN_AGENT_ID>",
167  "run_id": "<RUN_ID>",
168  "evaluation_spec_id": "<EVALUATION_SPEC_ID>",
169  "overall_score": 0.91,
170  "correctness_score": 0.95,
171  "reliability_score": 0.9,
172  "latency_score": 0.8,
173  "cost_score": 0.7,
174  "behavioral_score": 0.85,
175  "llm_judge_results": [
176    {
177      "id": "<JUDGE_RESULT_ID>",
178      "judge_key": "<judge_key>",
179      "mode": "<mode>",
180      "normalized_score": 0.8,
181      "confidence": "medium",
182      "variance": 0.02,
183      "sample_count": 3,
184      "model_count": 1,
185      "payload": {},
186      "created_at": "<timestamp>",
187      "updated_at": "<timestamp>"
188    }
189  ],
190  "scorecard": {
191    "run_agent_id": "<RUN_AGENT_ID>",
192    "evaluation_spec_id": "<EVALUATION_SPEC_ID>",
193    "status": "complete",
194    "strategy": "<strategy>",
195    "overall_score": 0.91,
196    "passed": true,
197    "overall_reason": "<reason>",
198    "warnings": [],
199    "dimensions": {
200      "correctness": {
201        "state": "available",
202        "score": 0.95,
203        "reason": "<reason>",
204        "weight": 1,
205        "gate": true,
206        "pass_threshold": 0.8,
207        "contribution": 0.95,
208        "gate_passed": true
209      }
210    },
211    "validator_summary": {},
212    "validator_details": [
213      {
214        "key": "<validator_key>",
215        "type": "<validator_type>",
216        "verdict": "pass",
217        "state": "complete",
218        "reason": "<reason>",
219        "normalized_score": 1,
220        "source": {
221          "kind": "final_output",
222          "sequence": 12,
223          "event_type": "<event type>",
224          "field_path": "<field path>"
225        }
226      }
227    ],
228    "metric_summary": {},
229    "metric_details": [
230      {
231        "key": "<metric_key>",
232        "collector": "<collector>",
233        "state": "available",
234        "numeric_value": 123
235      }
236    ]
237  },
238  "created_at": "<timestamp>",
239  "updated_at": "<timestamp>"
240}
241```
242
243Read the nested `scorecard.dimensions` first, then inspect `validator_details`, `metric_details`, and `llm_judge_results` for supporting evidence. Treat `llm_judge_results.payload` as judge-specific raw data; do not invent a stable schema inside it.
244
245## Eval Scorecard Envelope
246`agentclash eval scorecard [RUN_ID] --agent <RUN_AGENT_ID_OR_LABEL> --json` is run-first. If `RUN_ID` is omitted, the CLI selects the latest run in the workspace. If the run has multiple agents in non-interactive mode, pass `--agent`.
247
248Structured output is an envelope:
249
250```json
251{
252  "candidate": {
253    "workspace_id": "<WORKSPACE_ID>",
254    "run_id": "<RUN_ID>",
255    "run_name": "<name>",
256    "run_status": "completed",
257    "run_agent_id": "<RUN_AGENT_ID>",
258    "run_agent_label": "<label>",
259    "official_pack_mode": "full"
260  },
261  "baseline": null,
262  "scorecard": {},
263  "comparison": null,
264  "release_gate": null
265}
266```
267
268When a baseline bookmark exists, the CLI also fetches `/v1/compare` and `/v1/release-gates/evaluate`, then fills `baseline`, `comparison`, and `release_gate`. Use this envelope for regression-style summaries; use `run scorecard` for the raw per-agent scorecard only.
269
270## Replay JSON
271`agentclash replay get <RUN_AGENT_ID> --json` returns replay state, optional replay metadata, steps, and pagination:
272
273```json
274{
275  "state": "ready",
276  "run_agent_id": "<RUN_AGENT_ID>",
277  "run_id": "<RUN_ID>",
278  "run_agent_status": "completed",
279  "replay": {
280    "id": "<REPLAY_ID>",
281    "artifact_id": "<ARTIFACT_ID>",
282    "summary": {},
283    "latest_sequence_number": 12,
284    "event_count": 42,
285    "created_at": "<timestamp>",
286    "updated_at": "<timestamp>"
287  },
288  "steps": [
289    {
290      "sequence_number": 12,
291      "step_type": "<step type>",
292      "summary": "<summary>"
293    }
294  ],
295  "pagination": {
296    "next_cursor": "50",
297    "limit": 50,
298    "total_steps": 120,
299    "has_more": true
300  }
301}
302```
303
304Use replay to verify whether scorecard and judge claims match observable behavior. Prefer referenced `sequence_number` values from failure items or validator sources before paging through the whole replay.
305
306## Failure Review JSON
307`agentclash run failures <RUN_ID> --json` returns:
308
309```json
310{
311  "items": [
312    {
313      "run_id": "<RUN_ID>",
314      "run_agent_id": "<RUN_AGENT_ID>",
315      "challenge_identity_id": "<CHALLENGE_ID>",
316      "challenge_key": "<challenge_key>",
317      "case_key": "<case_key>",
318      "item_key": "<item_key>",
319      "failure_fingerprint": "frf_...",
320      "failure_cluster_key": "frc_...",
321      "failure_state": "failed",
322      "failed_dimensions": ["correctness"],
323      "failed_checks": ["<validator_or_judge_key>"],
324      "failure_class": "policy_violation",
325      "headline": "<headline>",
326      "detail": "<detail>",
327      "recommended_action": "<recommended action>",
328      "promotable": true,
329      "promotion_mode_available": ["full_executable", "output_only"],
330      "replay_step_refs": [
331        {
332          "sequence_number": 12,
333          "event_type": "<event type>",
334          "kind": "<kind>"
335        }
336      ],
337      "artifact_refs": [
338        {
339          "key": "<artifact key>",
340          "kind": "<kind>",
341          "path": "<path>",
342          "media_type": "<media type>"
343        }
344      ],
345      "judge_refs": [
346        {
347          "key": "<judge key>",
348          "kind": "llm_judge",
349          "state": "fail",
350          "normalized_score": 0.2,
351          "reason": "<reason>",
352          "sequence_number": 12,
353          "event_type": "<event type>"
354        }
355      ],
356      "metric_refs": [
357        {
358          "key": "<metric key>",
359          "metric_type": "<type>",
360          "state": "available",
361          "numeric_value": 123
362        }
363      ],
364      "evidence_tier": "hosted_structured",
365      "severity": "blocking"
366    }
367  ],
368  "clusters": [
369    {
370      "failure_cluster_key": "frc_...",
371      "representative_failure_fingerprint": "frf_...",
372      "count": 2,
373      "promotable_count": 1,
374      "severity": "blocking",
375      "failure_state": "failed",
376      "failure_class": "policy_violation",
377      "evidence_tier": "hosted_structured",
378      "challenge_keys": ["<challenge_key>"],
379      "case_keys": ["<case_key>"],
380      "run_agent_ids": ["<RUN_AGENT_ID>"],
381      "headline": "<headline>",
382      "recommended_action": "<recommended action>"
383    }
384  ],
385  "next_cursor": "<cursor>"
386}
387```
388
389Filters supported by the CLI:
390
391- `--agent <RUN_AGENT_ID>`
392- `--severity info|warning|blocking`
393- `--class <failure_class>`
394- `--evidence-tier none|native_structured|hosted_structured|hosted_black_box|derived_summary`
395- `--cluster <FAILURE_CLUSTER_KEY>`
396- `--cursor <NEXT_CURSOR>`
397- `--limit <COUNT>`
398
399Failure classes currently accepted by the API are `incorrect_final_output`, `tool_selection_error`, `tool_argument_error`, `retrieval_grounding_failure`, `policy_violation`, `timeout_or_budget_exhaustion`, `sandbox_failure`, `dependency_resolution_failure`, `malformed_output`, `flaky_non_deterministic`, `insufficient_evidence`, and `other`.
400
401## Stateful Reads and Exit Codes
402Ranking, scorecard, and replay reads can be ready, pending, or errored.
403
404- Pending responses use HTTP 202 and include `state: "pending"` plus `message`. The CLI prints the raw payload in structured mode and exits successfully.
405- Errored responses use HTTP 409 and include `state: "errored"` plus `message`. The CLI prints the raw payload in structured mode and exits with code 1.
406- Common messages include `ranking is not ready yet`, `scorecard generation is pending`, `scorecard generation failed or scorecard data is unavailable`, and `replay generation is pending`.
407
408When a read is pending, do not infer pass/fail. Re-check later or inspect `run get`, `run agents`, and `run events`. When a read is errored, report the state and switch to available run, failure, event, and artifact evidence.
409
410## Evidence-First Interpretation
411Use this ordering when writing findings:
412
4131. Ranking: winner, sort field, score gap, evidence warnings, unavailable scores.
4142. Scorecard dimensions: failed gates, low scores, unavailable/error dimensions, `overall_reason`, warnings.
4153. Validator and metric details: exact failed checks, source refs, numeric values, reasons.
4164. LLM judge results: judge key, mode, normalized score, confidence, variance, sample/model counts, concise rationale if present in payload.
4175. Failure review: `failure_state`, `failure_class`, `severity`, `evidence_tier`, refs, `recommended_action`, promotability.
4186. Replay: sequence refs and behavior observed around final output, tool calls, sandbox failures, or malformed outputs.
4197. Artifacts: downloaded only when needed to verify file output or user-visible content.
420
421Do not say "the judge proved" something. Say "the scorecard/judge reports X, and replay/artifact evidence Y supports or contradicts it."
422
423## Expected Output
424- A winner or status summary that names the run and agent IDs.
425- A short list of findings with evidence pointers to scorecard fields, failure fingerprints or clusters, replay sequence numbers, and artifact IDs or paths.
426- A distinction between confirmed evidence, judge rationale, unavailable data, and pending/errored reads.
427- Follow-up commands a reviewer can run exactly.
428
429## Failure Modes
430- Missing workspace: run `agentclash link`, `agentclash workspace use <id>`, pass `--workspace`, or set `AGENTCLASH_WORKSPACE`.
431- Multiple run agents for `eval scorecard`: pass `--agent <RUN_AGENT_ID_OR_LABEL>`.
432- Ranking pending: wait for scoring or inspect `run get`, `run agents`, and `run events`.
433- Scorecard pending: wait for scorecard generation; do not report pass/fail yet.
434- Scorecard errored: the run agent may have failed, or scorecard data may be unavailable. Use failures, replay, events, and artifacts instead.
435- Replay pending or noisy: use failure `replay_step_refs` or validator `source.sequence` before paging broadly.
436- `artifact list --run` fails: the command does not exist. Use workspace-wide `artifact list --json` or artifact refs from failure/scorecard evidence.
437- Unknown `--sort-by`: use a built-in sort field or a dimension key that exists in the scorecard.
438
439## Safety Notes
440- Do not paste secrets, private artifact contents, raw provider keys, customer data, or long logs into chat.
441- Do not overstate LLM judge rationale as ground truth.
442- Download artifacts only when needed, and prefer targeted artifact IDs from evidence refs.
443- Scorecards and failures can contain model outputs and tool traces. Quote only the minimum needed to support the finding.
444- Read commands are safe, but follow-up mutation commands such as failure promotion belong to `agentclash-regression-flywheel` and should be intentional.
445
446## Report Back Format
447```text
448Outcome: <winner/status/pending/errored>
449Run: <RUN_ID>
450Agent(s): <RUN_AGENT_ID or labels>
451Evidence:
452- <claim> | scorecard=<field/path> | replay=<sequence or none> | artifact=<id/path or none>
453Findings:
454- <impact> -> <next action>
455Uncertainties:
456- <pending/errored/unavailable evidence, or none>
457Follow-up commands:
458- agentclash run ranking <RUN_ID> --json
459- agentclash run failures <RUN_ID> --json
460- agentclash run scorecard <RUN_AGENT_ID> --json
461- agentclash replay get <RUN_AGENT_ID> --limit 50 --json
462Next skill: <agentclash-regression-flywheel | challenge-pack skill | none>
463```
464
465## Related Skills
466- `agentclash-hub`
467- `agentclash-cli-setup`
468- `agentclash-eval-runner`
469- `agentclash-compare-and-triage`
470- `agentclash-regression-flywheel`
471- `agentclash-ci-release-gate`
472
473## Related Docs
474- `/docs-md/concepts/replay-and-scorecards`
475- `/docs-md/concepts/runs-and-evals`
476- `/docs-md/concepts/artifacts`
477- `/docs-md/reference/cli`