Agent Skills
Challenge Pack Input Sets Skill
Use when designing AgentClash challenge pack cases and input sets for smoke, full benchmark, regression, edge-case, or CI suite-only coverage.
Canonical source: web/content/agent-skills/challenge-pack-skills/agentclash-challenge-pack-input-sets/SKILL.md
Markdown export: /docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-input-sets
Use This Skill When
Use when designing AgentClash challenge pack cases and input sets for smoke, full benchmark, regression, edge-case, or CI suite-only coverage.
Full SKILL.md
markdown
1---
2name: agentclash-challenge-pack-input-sets
3description: Use when designing AgentClash challenge pack cases and input sets for smoke, full benchmark, regression, edge-case, or CI suite-only coverage.
4metadata:
5 agentclash.role: challenge-pack-inputs
6 agentclash.version: "1"
7 agentclash.requires_cli: "true"
8---
9
10# AgentClash Challenge Pack Input Sets
11
12## Purpose
13Design `input_sets` and cases that are valid, repeatable, and useful for runs, regression promotion, and CI gates.
14
15Use this skill after the challenge pack structure is known and before scoring is finalized. The goal is to make every case observable: each case should have a stable key, a declared challenge, concrete inputs, expected evidence, and a clear reason to exist.
16
17## Use When
18- A challenge pack needs smoke, full, regression, edge-case, or CI-oriented case subsets.
19- Cases exist but are poorly named, duplicated, too broad, or hard to score.
20- A coding agent needs exact `input_sets[].cases[]` YAML shape without reading the AgentClash source repo.
21- You need to decide which cases are safe for fast checks versus full benchmark runs.
22
23## Do Not Use When
24- The pack idea is still vague; use `agentclash-challenge-pack-planner`.
25- The user needs the whole YAML file written; use `agentclash-challenge-pack-yaml-author`.
26- The task is to configure validators, judges, tools, sandbox, artifacts, validation, publish, or run creation; use the focused downstream skills.
27
28## Environment
29Use hosted production for CLI examples unless the user intentionally targets a local or self-hosted backend.
30
31```bash
32export AGENTCLASH_API_URL="https://api.agentclash.dev"
33```
34
35## Validation Commands
36Validate the pack after editing input sets.
37
38```bash
39agentclash challenge-pack validate path/to/pack.yaml
40agentclash challenge-pack validate path/to/pack.yaml --json
41```
42
43Human output prints `Challenge pack is valid` or `Challenge pack has errors`. Use `--json` for structured `valid` and `errors` fields.
44
45## Exact YAML Shape
46The current bundle parser accepts top-level `input_sets`, each with `key`, `name`, optional `description`, and `cases`.
47
48```yaml
49input_sets:
50 - key: refund-smoke
51 name: Refund Smoke
52 description: Fast refund-policy checks for CI and authoring smoke tests.
53 cases:
54 - challenge_key: refund-question
55 case_key: refund-window-basic
56 payload:
57 customer_message: Can I get a refund after 14 days?
58 account_tier: standard
59 inputs:
60 - key: prompt
61 kind: text
62 value: Can I get a refund after 14 days?
63 expectations:
64 - key: policy_reference
65 kind: text
66 source: input:prompt
67```
68
69Case fields:
70
71- `challenge_key`: required, must reference a declared `challenges[].key`.
72- `case_key`: required for new YAML, stable, and unique per challenge within the input set.
73- `payload`: optional structured data for the case, such as text, JSON-like fields, IDs, or fixture metadata.
74- `inputs`: optional list of concrete case inputs.
75- `expectations`: optional list of expected evidence or references.
76- `artifacts`: optional list of case artifact refs that must reference declared version assets.
77- `assets`: optional case-local assets, each with unique `key` and required `path`.
78
79Use `cases`, not legacy `items`, in new YAML. Legacy `items` are normalized by the parser, but new skills and packs should not author them.
80
81## Hard Validator Rules
82- Every input set needs `key`, `name`, and at least one case.
83- `input_sets[].key` values must be unique.
84- Every case needs `challenge_key` and `case_key`.
85- Every `challenge_key` must reference a declared challenge.
86- All cases inside the same input set must reference the same `challenge_key`.
87- A `case_key` must be unique per challenge within that input set.
88- Every case input needs unique `key` and required `kind`.
89- Every case expectation needs unique `key` and required `kind`.
90- `artifact_key` on inputs or expectations must reference a declared version asset.
91- Expectation `source` must be empty, `input:<case-input-key>`, or `artifact:<version-asset-key>`, for example `source: input:prompt`.
92
93Because an input set cannot mix challenge keys today, use separate input sets per challenge when designing pack-wide smoke or CI coverage:
94
95```yaml
96input_sets:
97 - key: refund-smoke
98 name: Refund Smoke
99 cases:
100 - challenge_key: refund-question
101 case_key: refund-window-basic
102 - key: billing-smoke
103 name: Billing Smoke
104 cases:
105 - challenge_key: billing-question
106 case_key: invoice-copy-basic
107```
108
109## Input And Expectation Patterns
110Use `payload` for case data the evaluator or prompt builder should understand as structured context. Use `inputs` when individual evidence keys need to be referenced by expectations, validators, judges, or review output.
111
112```yaml
113cases:
114 - challenge_key: summarize-policy
115 case_key: policy-summary-edge-exclusions
116 payload:
117 audience: customer-support-agent
118 risk: exclusion missed
119 inputs:
120 - key: prompt
121 kind: text
122 value: Summarize the policy and call out exclusions.
123 - key: source_doc
124 kind: file
125 artifact_key: policy_pdf
126 path: assets/policy.pdf
127 expectations:
128 - key: required_topics
129 kind: json
130 value:
131 must_include:
132 - refund window
133 - exclusions
134 - key: prompt_reference
135 kind: text
136 source: input:prompt
137```
138
139Use `value` when the expected content is inline. Use `source: input:<key>` when the expectation should refer to an input in the same case. Use `source: artifact:<key>` only for version-level assets.
140
141## Input Set Types
142Use names that describe run intent and challenge scope.
143
144| Input set | Purpose | Typical size | Guidance |
145| --- | --- | --- | --- |
146| `<challenge>-smoke` | Fast sanity check | 1-3 cases | Covers the most ordinary success path and one cheap edge case. |
147| `<challenge>-ci` | CI gate input set | 1-5 cases | Deterministic, stable, low-cost, no flaky external dependency. |
148| `<challenge>-full` | Benchmark coverage | 5+ cases | Representative distribution across easy, medium, hard, and expert cases. |
149| `<challenge>-regression` | Known failure replay | As needed | Minimal reproductions of real failures with stable expectations. |
150| `<challenge>-edge` | Boundary behavior | Focused | Valid unusual inputs, ambiguity, malformed-but-recoverable payloads, or safety guardrails. |
151
152Do not put unrelated capabilities into one input set. If a run needs multiple challenges, model that through multiple challenge-specific input sets or downstream run/eval selection, not a mixed `challenge_key` input set.
153
154## Coverage Review
155For each challenge, check:
156
157- Happy path: ordinary user request or fixture.
158- Edge path: unusual but valid input.
159- Negative path: refusal, abstention, rejection, or safe fallback when appropriate.
160- Ambiguous path: should ask for clarification or make a defensible assumption.
161- Regression path: known previous failure with the smallest reproducible case.
162- Budget path: confirms the case can run within intended time, tool, and cost limits.
163
164Each case should have a reason. If two cases would fail for the same reason and exercise the same evidence, keep the clearer one unless you need variance.
165
166## Stable Key Rules
167- Use lowercase kebab-case keys: `refund-window-basic`, `invoice-missing-id`, `policy-summary-edge-exclusions`.
168- Do not include dates, random IDs, or run IDs unless they are part of the scenario being tested.
169- Keep `case_key` stable after publish; downstream results, regressions, and reports become easier to compare.
170- Prefer descriptive input keys such as `prompt`, `source_doc`, `expected_schema`, or `customer_record`.
171- Keep fixture IDs inside `payload`, not in the `case_key`, unless the fixture identity is the scenario.
172
173## Regression And CI Guidance
174Regression input sets should be small and forensic: they preserve the evidence needed to reproduce a known failure. Use them to seed regression suites later, but do not confuse pack `input_sets` with regression suites.
175
176CI-oriented input sets should be deterministic. Avoid:
177- live third-party data that changes without fixture control
178- broad network dependency
179- subjective-only expectations with no stable evidence
180- large file sets when a smaller fixture proves the behavior
181
182The eval runner can select a published input set with:
183
184```bash
185agentclash eval start --input-set <INPUT_SET_ID_OR_KEY_OR_EXACT_NAME>
186```
187
188`--scope suite_only` is for regression suite/case selection, not a replacement for `input_sets`.
189
190## Common Validation Failures
191- One input set mixes `challenge_key` values.
192- The case has neither `case_key` nor legacy `item_key`.
193- Duplicate `input_sets[].key`, duplicate case keys, duplicate input keys, or duplicate expectation keys.
194- `challenge_key` points at a title instead of the declared challenge `key`.
195- `source: input:...` references an input key that does not exist in the same case.
196- `source: artifact:...` or `artifact_key` references an undeclared version asset.
197- Case-local assets omit `path`.
198- The case has expectations that are impossible to observe in final output, files, artifacts, or judge context.
199
200## Authoring Procedure
2011. List challenges and confirm their exact `key` values.
2022. Draft cases per challenge, not across challenges.
2033. Split by run intent: smoke, CI, full, regression, and edge.
2044. Give every case a stable `case_key`, concrete `payload`, and clear reason.
2055. Add `inputs` when expectations or scoring need named evidence.
2066. Add `expectations` that reference inline `value`, `input:<key>`, or `artifact:<key>` only when those references exist.
2077. Review duplicates and remove cases with no unique signal.
2088. Validate the pack with `agentclash challenge-pack validate ... --json`.
2099. Hand off to scoring, validation/publish, or eval runner skills.
210
211## Report Back Format
212```text
213Challenge:
214Input sets:
215- key:
216 purpose:
217 case count:
218 intended use: <smoke | ci | full | regression | edge>
219 cases:
220 - case_key:
221 payload summary:
222 inputs:
223 expectations:
224 reason:
225Coverage gaps:
226Validation command:
227Ready for scoring: <yes/no>
228Next skill:
229```
230
231## Related Skills
232- `agentclash-challenge-pack-planner`
233- `agentclash-challenge-pack-yaml-author`
234- `agentclash-challenge-pack-artifacts`
235- `agentclash-challenge-pack-scoring-validators`
236- `agentclash-challenge-pack-validation-publish`
237- `agentclash-eval-runner`
238- `agentclash-regression-flywheel`