Agent Skills
Eval Runner Skill
Use when starting, following, inspecting, or reporting AgentClash eval runs with the CLI, especially eval start, run create, deployment selection, input set selection, suite-only scopes, repetitions, events, rankings, failures, and scorecards.
Canonical source: web/content/agent-skills/agentclash-eval-runner/SKILL.md
Markdown export: /docs-md/agent-skills/agentclash-eval-runner
Use This Skill When
Use when starting, following, inspecting, or reporting AgentClash eval runs with the CLI, especially eval start, run create, deployment selection, input set selection, suite-only scopes, repetitions, events, rankings, failures, and scorecards.
Full SKILL.md
markdown
1---
2name: agentclash-eval-runner
3description: Use when starting, following, inspecting, or reporting AgentClash eval runs with the CLI, especially eval start, run create, deployment selection, input set selection, suite-only scopes, repetitions, events, rankings, failures, and scorecards.
4metadata:
5 agentclash.role: running
6 agentclash.version: "1"
7 agentclash.requires_cli: "true"
8---
9
10# AgentClash Eval Runner
11
12## Purpose
13Create an AgentClash eval run or eval session against a published challenge pack, follow it when useful, inspect evidence after it runs, and report stable commands a reviewer can repeat.
14
15## Use When
16- A user asks to run one or more agent deployments against a published challenge pack.
17- A user wants to choose a challenge pack, version, input set, deployment, regression suite, or run scope from the CLI.
18- A user wants live run events, rankings, failures, agents, or scorecards after a run starts.
19- A CI or local workflow needs exact non-interactive commands.
20
21## Do Not Use When
22- The challenge pack is not authored or published yet; use the challenge-pack skills first.
23- The user needs to create deployments or runtime resources; use `agentclash-agent-deployment-setup` or `agentclash-runtime-resources-setup`.
24- The task is only to interpret an already generated scorecard in depth; use `agentclash-scorecard-reader`.
25- The task is a release gate or CI manifest workflow; use `agentclash-ci-release-gate`.
26
27## Inputs Needed
28- Workspace ID or configured default workspace.
29- Challenge pack selector: pack ID, slug, exact name, or challenge pack version ID.
30- Challenge pack version selector: version ID or version number.
31- Input set selector: input set ID, key, or exact name.
32- Agent deployment selectors: deployment IDs or exact names.
33- Scope: `full` or `suite_only`.
34- Optional regression suite IDs/names or regression case IDs.
35- Whether to stream events with `--follow`.
36- Whether this is a repeated eval session with `--repetitions`.
37
38## Environment
39Use hosted production by default unless the user intentionally targets local or self-hosted infrastructure:
40
41```bash
42export AGENTCLASH_API_URL="https://api.agentclash.dev"
43```
44
45Before creating a run, verify auth and workspace context:
46
47```bash
48agentclash auth status
49agentclash workspace use <WORKSPACE_ID>
50agentclash challenge-pack list --json
51agentclash deployment list --json
52```
53
54Workspace resolution follows the CLI setup rules: `--workspace`, `AGENTCLASH_WORKSPACE`, saved config, or `.agentclash.yaml`. `eval start`, `run create`, `run list`, `run failures`, and `eval scorecard` require a workspace.
55
56## Prefer `eval start` for Humans and Agents
57`agentclash eval start` wraps `agentclash run create` but resolves selectors through workspace reads. Use it when names, slugs, input-set keys, or guided selection are useful.
58
59```bash
60agentclash eval start \
61 --pack <PACK_ID_OR_SLUG_OR_EXACT_NAME> \
62 --pack-version <VERSION_ID_OR_VERSION_NUMBER> \
63 --input-set <INPUT_SET_ID_OR_KEY_OR_EXACT_NAME> \
64 --deployment <DEPLOYMENT_ID_OR_EXACT_NAME> \
65 --name "Smoke eval" \
66 --follow
67```
68
69Exact `eval start` flags:
70
71- `--pack`: challenge pack ID, slug, or exact name.
72- `--pack-version`: challenge pack version ID or version number. Use `--pack` when selecting by version number; a version ID can identify the pack by itself.
73- `--input-set`: challenge input set ID, key, or exact name.
74- `--deployment`: deployment ID or exact name. Repeat this flag for multiple deployments.
75- `--name`: optional run name.
76- `--follow`: stream run events after creation.
77- `--scope`: `full` or `suite_only`; default is `full`.
78- `--suite`: regression suite ID or exact name. Repeatable.
79- `--case`: regression case IDs. Repeatable.
80- `--race-context`: enable live peer-standings injection during the run.
81- `--race-context-cadence`: 0 for backend default, otherwise 1 through 10.
82- `--repetitions`: repeat the eval 1 through 100 times; values 2 or greater use `/v1/eval-sessions`.
83
84Selector behavior:
85
86- Pack selectors match ID, slug, or exact name.
87- Deployment selectors match ID or exact name.
88- Suite selectors match ID or exact name, are filtered to the selected pack when possible, and must resolve to active suites.
89- Input set selectors match ID, input key, or exact name.
90- Selectors are exact or case-insensitive exact matches, not substring search.
91- If no pack is specified and there is one pack, the CLI uses it; with multiple packs in non-interactive mode, pass `--pack` or `--pack-version`.
92- If no version is specified, the CLI uses the highest `version_number` for the selected pack.
93- If a version has no input sets, the CLI submits without `challenge_input_set_id`.
94- If a version has one input set, the CLI uses it.
95- If a version has multiple input sets in non-interactive mode, pass `--input-set`.
96- If multiple deployments exist in non-interactive mode, pass at least one `--deployment`.
97
98## Use `run create` for ID-First Automation
99`agentclash run create` posts directly to `/v1/runs`. Use it when a script already has IDs.
100
101```bash
102agentclash run create \
103 --challenge-pack-version <CHALLENGE_PACK_VERSION_ID> \
104 --input-set <CHALLENGE_INPUT_SET_ID> \
105 --deployments <AGENT_DEPLOYMENT_ID> \
106 --name "Smoke eval" \
107 --follow
108```
109
110Exact `run create` notes:
111
112- The lower-level flag is plural `--deployments`; it expects deployment IDs.
113- `--challenge-pack-version` expects a challenge pack version ID.
114- `--input-set` expects a challenge input set ID.
115- In non-interactive mode, `--challenge-pack-version` and `--deployments` are required.
116- In a TTY, missing challenge pack version, input set, or deployments can open pickers.
117- `run create` does not resolve pack slugs, input set keys, or deployment names. Use `eval start` for that.
118- `--scope`, `--suite`, `--case`, `--race-context`, and `--race-context-cadence` behave like `eval start`, but suite and case flags are ID-first.
119
120The run create request body sent by the CLI contains:
121
122```json
123{
124 "workspace_id": "<WORKSPACE_ID>",
125 "challenge_pack_version_id": "<CHALLENGE_PACK_VERSION_ID>",
126 "challenge_input_set_id": "<CHALLENGE_INPUT_SET_ID>",
127 "agent_deployment_ids": ["<AGENT_DEPLOYMENT_ID>"],
128 "official_pack_mode": "full",
129 "name": "Smoke eval",
130 "regression_suite_ids": ["<REGRESSION_SUITE_ID>"],
131 "regression_case_ids": ["<REGRESSION_CASE_ID>"],
132 "race_context": true,
133 "race_context_min_step_gap": 3
134}
135```
136
137Optional fields are omitted when not set. The create-run API requires JSON, caps the body at 1 MiB, rejects unknown JSON fields, and returns:
138
139```json
140{
141 "id": "<RUN_ID>",
142 "workspace_id": "<WORKSPACE_ID>",
143 "challenge_pack_version_id": "<CHALLENGE_PACK_VERSION_ID>",
144 "challenge_input_set_id": "<CHALLENGE_INPUT_SET_ID>",
145 "official_pack_mode": "full",
146 "status": "queued",
147 "execution_mode": "single_agent",
148 "created_at": "<timestamp>",
149 "queued_at": "<timestamp>",
150 "race_context": false,
151 "links": {
152 "self": "/v1/runs/<RUN_ID>",
153 "agents": "/v1/runs/<RUN_ID>/agents"
154 }
155}
156```
157
158## Suite-Only Runs
159Use suite-only scope when you want to run only selected regression suites or cases.
160
161With `eval start`, suites can be IDs or exact names:
162
163```bash
164agentclash eval start \
165 --pack <PACK_ID_OR_SLUG> \
166 --pack-version <VERSION_ID_OR_NUMBER> \
167 --deployment <DEPLOYMENT_ID_OR_NAME> \
168 --scope suite_only \
169 --suite <REGRESSION_SUITE_ID_OR_EXACT_NAME> \
170 --follow
171```
172
173With `run create`, use IDs:
174
175```bash
176agentclash run create \
177 --challenge-pack-version <CHALLENGE_PACK_VERSION_ID> \
178 --deployments <AGENT_DEPLOYMENT_ID> \
179 --scope suite_only \
180 --suite <REGRESSION_SUITE_ID> \
181 --follow
182```
183
184`--scope suite_only` requires at least one `--suite` or `--case`.
185
186## Repeated Eval Sessions
187Use `--repetitions` on `eval start` for repeated runs of the same eval.
188
189```bash
190agentclash eval start \
191 --pack <PACK_ID_OR_SLUG> \
192 --pack-version <VERSION_ID_OR_NUMBER> \
193 --input-set <INPUT_SET_ID_OR_KEY> \
194 --deployment <DEPLOYMENT_ID_OR_NAME> \
195 --repetitions 3 \
196 --json
197```
198
199Exact repetition behavior:
200
201- `--repetitions` must be between 1 and 100.
202- `--repetitions 1` creates a normal run through `/v1/runs`.
203- `--repetitions >= 2` posts to `/v1/eval-sessions`.
204- `--follow` is not supported with `--repetitions >= 2`; tail individual child runs with `agentclash run events <RUN_ID>`.
205- `--scope suite_only`, `--suite`, `--case`, and race-context flags are not supported with `--repetitions >= 2`.
206- The eval-session response is `{ "eval_session": {...}, "run_ids": [...] }`.
207
208In human output, the CLI prints eval session ID, status, repetitions, and child run IDs. In structured output, it prints the raw response envelope.
209
210## Eval Session Commands
211When an eval session already exists (from `--repetitions >= 2` or API), inspect and follow aggregation with:
212
213```bash
214agentclash eval session list
215agentclash eval session list --limit 20 --offset 0
216agentclash eval session get <EVAL_SESSION_ID>
217agentclash eval session follow <EVAL_SESSION_ID>
218agentclash eval session follow <EVAL_SESSION_ID> --poll-interval 5s --timeout 30m
219```
220
221Behavior:
222
223- `eval session list` — paginated workspace eval sessions (`--limit` 1–100, `--offset` ≥ 0).
224- `eval session get` — session detail plus aggregate metrics when available.
225- `eval session follow` — polls until aggregation finishes; `--timeout 0` disables timeout.
226- Requires workspace context like other eval commands.
227
228Use `eval session follow` after creating a multi-repetition eval when you need aggregated results before reading scorecards or comparisons.
229
230## Run Series Commands
231`run series` crosses deployment lineups with seeds for race-style series evals:
232
233```bash
234agentclash run series create \
235 --challenge-pack-version <CHALLENGE_PACK_VERSION_ID> \
236 --deployment-lineups <LINEUP_ID> \
237 --seeds 3 \
238 --name "Series smoke"
239
240agentclash run series report <EVAL_SESSION_ID>
241```
242
243`run series create` flags:
244
245- `--challenge-pack-version` — required pack version ID.
246- `--input-set` — optional challenge input set ID.
247- `--deployment-lineups` — repeatable lineup IDs crossed with `--seeds`.
248- `--seeds` — integer 1–100 per lineup.
249- `--name` — optional series name.
250- `--max-iter` — optional per-child-run iteration override (1–1000).
251
252`run series create` posts to `/v1/eval-sessions` and returns an eval session plus child run IDs (same family as `eval start --repetitions`).
253
254`run series report <eval-session-id>` reads `/v1/eval-sessions/<id>` and prints aggregate score, correctness, and cost for the series.
255
256## Follow and Events
257Use `--follow` for interactive runs when you want immediate event visibility.
258
259```bash
260agentclash eval start ... --follow
261agentclash run create ... --follow
262agentclash run events <RUN_ID>
263```
264
265`run events` streams `/v1/runs/<runID>/events/stream` via SSE.
266
267- In structured output mode (`--json` or `--output yaml`), `eval start --follow` and `run create --follow` print the created run and do not stream events. Use `agentclash run events <RUN_ID> --json` or `--output yaml` for structured event streams.
268- Human output prints timestamped event summaries.
269- `--json` prints one NDJSON event payload per line.
270- `--output yaml` prints a YAML multi-document stream.
271- Press Ctrl+C to stop an event stream.
272
273## Inspect After Creation
274Use these read commands after a run is created:
275
276```bash
277agentclash run list --json
278agentclash run get <RUN_ID> --json
279agentclash run agents <RUN_ID> --json
280agentclash run ranking <RUN_ID> --json
281agentclash run ranking <RUN_ID> --sort-by composite
282agentclash run failures <RUN_ID> --json
283agentclash eval scorecard <RUN_ID> --agent <RUN_AGENT_ID_OR_LABEL> --json
284agentclash run scorecard <RUN_AGENT_ID> --json
285```
286
287Read command notes:
288
289- `run list` lists runs in the workspace.
290- `run get` reads `/v1/runs/<id>`.
291- `run agents` lists run agents and labels.
292- `run ranking --sort-by` accepts `composite`, `correctness`, `reliability`, `latency`, or `cost`.
293- `run failures` accepts `--agent`, `--severity`, `--class`, `--evidence-tier`, `--cluster`, `--cursor`, and `--limit`.
294- `run scorecard` expects a run agent ID.
295- `eval scorecard [run]` is run-first. If run is omitted, it selects the latest workspace run; with multiple run agents, pass `--agent` in non-interactive mode.
296- `eval scorecard --json` returns an envelope with `candidate`, `baseline`, `scorecard`, `comparison`, and `release_gate`.
297- If scorecard generation is pending, stateful scorecard reads can return a pending payload instead of a final scorecard.
298
299## Common Failure Modes
300- No workspace: run `agentclash link`, `agentclash workspace use <id>`, pass `--workspace`, or set `AGENTCLASH_WORKSPACE`.
301- No challenge packs: publish a pack first with `agentclash-challenge-pack-validation-publish`.
302- Multiple packs in non-interactive `eval start`: pass `--pack` or a version ID through `--pack-version`.
303- Version number without pack: pass `--pack` as well, because a bare version number cannot identify a pack.
304- Multiple input sets: pass `--input-set`; `eval start` can use ID/key/exact name, while `run create` expects ID.
305- Multiple deployments: pass one or more `--deployment` flags for `eval start`, or `--deployments` IDs for `run create`.
306- `missing_challenge_input_set_id`: the selected pack version has multiple input sets and no input set ID was submitted.
307- `invalid_agent_deployment_ids`: deployment IDs must be active deployments with snapshots in the selected workspace, with no duplicates.
308- `invalid_challenge_pack_version_id`: the version must be runnable and visible to the selected workspace.
309- `invalid_challenge_input_set_id`: the input set must belong to the selected challenge pack version.
310- `invalid_race_context`: race context requires at least two agents.
311- `--race-context-cadence must be 0 (backend default) or between 1 and 10`: fix the cadence value.
312- `--follow is not supported with --repetitions >= 2`: create the eval session, then use `eval session follow` or stream individual child runs with `run events`.
313- `--scope suite_only requires at least one --suite or --case`: add a suite or case selection.
314- Scorecard pending or errored: report the state, then collect `run events`, `run agents`, and `run failures`.
315
316## Safety Notes
317- Creating runs can spend provider budget and may execute tools or network access allowed by the deployment/runtime profile.
318- Confirm before running production-scale, multi-deployment, high-repetition, or network-enabled evals.
319- Prefer small input sets and `--scope suite_only` for smoke checks.
320- Do not paste secrets from run events, scorecards, failures, artifacts, or logs into chat.
321- Use `--json` for automation and save run IDs before starting follow streams.
322
323## Report Back Format
324```text
325Workspace: <workspace-id>
326Command used:
327Run ID: <id or none>
328Eval session ID: <id or none>
329Child run IDs: <ids if repetitions >= 2>
330Challenge pack version: <id>
331Input set: <id/key/name or none>
332Deployments:
333- <id/name>
334Scope: <full|suite_only>
335Followed: <yes/no>
336Status: <queued/running/completed/failed/etc>
337Agents: <count and labels>
338Ranking: <summary or unavailable>
339Failures: <count/filter summary or unavailable>
340Scorecard: <state/link/summary or unavailable>
341Evidence commands:
342- agentclash eval session get <EVAL_SESSION_ID> --json
343- agentclash run series report <EVAL_SESSION_ID>
344- agentclash run get <RUN_ID> --json
345- agentclash run agents <RUN_ID> --json
346- agentclash run ranking <RUN_ID> --json
347- agentclash run failures <RUN_ID> --json
348Next action: <recommendation>
349```
350
351## Related Skills
352- `agentclash-hub`
353- `agentclash-cli-setup`
354- `agentclash-quickstart`
355- `agentclash-agent-deployment-setup`
356- `agentclash-challenge-pack-validation-publish`
357- `agentclash-scorecard-reader`
358- `agentclash-compare-and-triage`
359- `agentclash-multi-turn-operator`
360- `agentclash-regression-flywheel`
361- `agentclash-ci-release-gate`
362
363## Related Docs
364- `/docs-md/getting-started/first-eval`
365- `/docs-md/concepts/runs-and-evals`
366- `/docs-md/concepts/replay-and-scorecards`
367- `/docs-md/reference/cli`