Agent Skills
Challenge Pack YAML Author Skill

Use when writing or editing AgentClash challenge pack YAML, including pack/version metadata, execution mode, challenges, cases, input sets, scoring blocks, tools, sandbox settings, assets, and validation handoff.
Canonical source: web/content/agent-skills/challenge-pack-skills/agentclash-challenge-pack-yaml-author/SKILL.md
Markdown export: /docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-yaml-author
Use This Skill When

Full SKILL.md

markdown
1---
2name: agentclash-challenge-pack-yaml-author
3description: Use when writing or editing AgentClash challenge pack YAML, including pack/version metadata, execution mode, challenges, cases, input sets, scoring blocks, tools, sandbox settings, assets, and validation handoff.
4metadata:
5  agentclash.role: challenge-pack-authoring
6  agentclash.version: "1"
7  agentclash.requires_cli: "true"
8---
9
10# AgentClash Challenge Pack YAML Author
11
12## Purpose
13Write challenge-pack YAML that matches the current AgentClash parser and validators, without requiring access to the AgentClash source repo.
14
15Use this skill after planning. The output should be a concrete YAML file plus the exact validation commands a coding agent should run before publish.
16
17## Use When
18- A challenge-pack plan needs to become valid YAML.
19- An existing pack YAML needs source-compatible edits.
20- A coding agent needs the exact YAML object shape for `pack`, `version`, `challenges`, `input_sets`, scoring, tools, sandbox, assets, cases, inputs, and expectations.
21- The agent will later hand the file to `agentclash-challenge-pack-validation-publish`.
22
23## Do Not Use When
24- The user only has a vague eval idea; use `agentclash-challenge-pack-planner` first.
25- The YAML is finished and only needs validation or publish; use `agentclash-challenge-pack-validation-publish`.
26- The task is to start runs or choose deployments; use `agentclash-eval-runner` or deployment skills.
27
28## Environment
29Use hosted production unless the user intentionally points at another backend.
30
31```bash
32export AGENTCLASH_API_URL="https://api.agentclash.dev"
33```
34
35## CLI Commands
36Start from the CLI template when possible, then edit the YAML.
37
38```bash
39agentclash challenge-pack init support-eval.yaml --template prompt_eval --name "Support Eval" --slug support-eval
40agentclash challenge-pack init native-files.yaml --template native --name "Native Files" --slug native-files
41agentclash challenge-pack validate support-eval.yaml
42agentclash challenge-pack validate support-eval.yaml --json
43```
44
45Human validation output prints `Challenge pack is valid` or `Challenge pack has errors`. Use `--json` when a coding agent needs structured fields such as `valid` and `errors`.
46
47## YAML Skeleton
48These are the top-level fields accepted by the bundle parser:
49
50```yaml
51pack:
52  slug: support-eval
53  name: Support Eval
54  family: support
55  description: Evaluates concise customer support answers.
56
57version:
58  number: 1
59  execution_mode: prompt_eval
60  evaluation_spec:
61    name: Support Eval Scoring
62    version_number: 1
63    judge_mode: deterministic
64    validators:
65      - key: mentions_refund_policy
66        type: contains
67        target: final_output
68        expected_from: literal:refund policy
69    scorecard:
70      strategy: weighted
71      dimensions:
72        - key: correctness
73          source: validators
74          weight: 1
75
76challenges:
77  - key: refund-question
78    title: Refund Policy Question
79    category: support
80    difficulty: easy
81    instructions: Answer the customer in a concise, helpful tone.
82
83input_sets:
84  - key: smoke
85    name: Smoke
86    description: Fast validation cases.
87    cases:
88      - challenge_key: refund-question
89        case_key: basic-refund
90        payload:
91          customer_message: Can I get a refund after 14 days?
92        inputs:
93          - key: prompt
94            kind: text
95            value: Can I get a refund after 14 days?
96        expectations:
97          - key: expected_policy
98            kind: text
99            source: input:prompt
100```
101
102## Required Fields
103- `pack.slug`, `pack.name`, and `pack.family` are required.
104- `version.number` must be greater than zero.
105- `version.execution_mode` should be explicit: use `prompt_eval` or `native`.
106- `version.evaluation_spec.validators` must contain at least one validator.
107- Every challenge needs `key`, `title`, `category`, and `difficulty`.
108- `difficulty` must be `easy`, `medium`, `hard`, or `expert`.
109- Every input set needs `key`, `name`, and at least one `cases` entry.
110- Every case needs `challenge_key` referencing a declared challenge and a stable `case_key`.
111- Use `cases`, not legacy `items`, in new YAML.
112
113## Execution Modes
114Use `prompt_eval` for prompt-style tasks where the agent only needs the prompt and final output.
115
116`prompt_eval` cannot use:
117- top-level `tools`
118- `version.tool_policy`
119- `version.sandbox`
120
121Use `native` when the challenge needs files, tools, network policy, package installation, file validators, directory checks, code execution, or sandbox behavior.
122
123## Native Tools And Sandbox
124Only include these blocks for `native` packs.
125
126```yaml
127tools:
128  custom:
129    - name: lookup_order
130      description: Looks up an order by ID.
131      parameters:
132        type: object
133        properties:
134          order_id:
135            type: string
136        required:
137          - order_id
138      implementation:
139        primitive: http_request
140        args:
141          method: GET
142          url: "https://example.test/orders/${order_id}"
143          headers:
144            Authorization: "Bearer ${secrets.ORDER_API_KEY}"
145
146version:
147  number: 1
148  execution_mode: native
149  tool_policy:
150    allowed_tool_kinds:
151      - browser
152      - file
153      - network
154  sandbox:
155    network_access: true
156    network_allowlist:
157      - 203.0.113.0/24
158    env_vars:
159      DATASET_MODE: fixture
160    additional_packages:
161      - jq
162```
163
164Supported `version.tool_policy.allowed_tool_kinds` values are exactly `browser`, `build`, `data`, `file`, and `network`. Do not use `shell` as an allowed tool kind.
165
166For `tools.custom`, each tool needs `name`, `parameters`, and `implementation`. Non-`mock` implementations need `implementation.primitive` and `implementation.args`. Template placeholders in `args` use `${parameter_name}` for declared parameters and may reference `${secrets.SECRET_KEY}` when the runtime provides that secret; never paste raw secret values into YAML.
167
168Sandbox rules:
169- `network_allowlist` entries must be valid CIDR ranges.
170- `env_vars` keys must look like shell env names, for example `DATASET_MODE`.
171- `additional_packages` entries must be valid apt-style package names.
172- Never put raw secrets in YAML. Name required secret keys in notes and configure them through the workspace/runtime/provider flow.
173
174## Assets, Inputs, Expectations, And Artifacts
175Assets may appear on `version`, challenges, or cases. Each asset list must use unique `key` values, and every asset needs `path`.
176
177```yaml
178version:
179  assets:
180    - key: policy_pdf
181      kind: file
182      path: assets/policy.pdf
183      media_type: application/pdf
184
185challenges:
186  - key: summarize-policy
187    title: Summarize Policy
188    category: documents
189    difficulty: medium
190    artifact_refs:
191      - key: policy_pdf
192
193input_sets:
194  - key: full
195    name: Full
196    cases:
197      - challenge_key: summarize-policy
198        case_key: policy-summary
199        inputs:
200          - key: source_doc
201            kind: file
202            artifact_key: policy_pdf
203            path: assets/policy.pdf
204        expectations:
205          - key: summary_requirements
206            kind: text
207            value: Mention refund window and exclusions.
208```
209
210Case input fields are `key`, `kind`, optional `value`, optional `artifact_key`, and optional `path`.
211
212Case expectation fields are `key`, `kind`, optional `value`, optional `artifact_key`, and optional `source`. Supported `source` values are empty, `input:<case-input-key>`, or `artifact:<version-asset-key>`, for example `source: input:prompt`.
213
214## Evaluation Spec
215`evaluation_spec` controls scoring. Keep deterministic checks deterministic, and use LLM judges only when subjective quality is genuinely needed.
216
217```yaml
218evaluation_spec:
219  name: Support Eval Scoring
220  version_number: 1
221  judge_mode: hybrid
222  validators:
223    - key: has_json
224      type: json_schema
225      target: final_output
226      expected_from: 'literal:{"type":"object","required":["answer"]}'
227  llm_judges:
228    - key: helpfulness
229      mode: rubric
230      model: gpt-4.1
231      rubric: Judge whether the answer is helpful, grounded, and concise.
232      context_from:
233        - challenge_input
234        - final_output
235  scorecard:
236    strategy: hybrid
237    dimensions:
238      - key: schema_gate
239        source: validators
240        validators:
241          - has_json
242        weight: 1
243        gate: true
244        pass_threshold: 1
245      - key: helpfulness
246        source: llm_judge
247        judge_key: helpfulness
248        weight: 1
249```
250
251Supported `judge_mode` values are `deterministic`, `llm_judge`, and `hybrid`.
252
253Supported validator types include `exact_match`, `contains`, `regex_match`, `json_schema`, `json_path_match`, `boolean_assert`, `fuzzy_match`, `numeric_match`, `normalized_match`, `token_f1`, `math_equivalence`, `bleu_score`, `rouge_score`, `chrf_score`, `file_content_match`, `file_exists`, `file_json_schema`, `directory_structure`, `code_execution`, `tool_call_assertion`, and `postcondition`.
254
255Evidence references accepted by validators and judges include `final_output`, `run.final_output`, `challenge_input`, `case.payload`, `case.payload.<path>`, `case.inputs.<path>`, `case.expectations.<path>`, `artifact.<path>`, `file:<post_execution_check_key>`, and `literal:<value>`.
256
257File validators require a `file:` target. `code_execution` validators must target a `post_execution_checks` entry of type `file_capture`. `tool_call_assertion` validators must target `tool_calls` and omit `expected_from`. `postcondition` validators target `file:<post_execution_check_key>`, omit `expected_from`, and use `config.condition` for declarative post-run checks.
258
259For `judge_mode: deterministic`, omit `llm_judges`. For `judge_mode: llm_judge`, include at least one judge. For `judge_mode: hybrid`, include validators and at least one judge; hybrid scorecards need a gated dimension.
260
261For scorecard dimensions with `source: validators`, omit the `validators` list only when the dimension should score every validator. Add `validators: [<validator_key>]` when the dimension should cover a specific subset.
262
263Do not put `${secrets.*}` references in LLM judge `rubric`, `assertion`, or `prompt` text. Secrets are allowed in native tool implementation args when the runtime provides them, but judge prompt text rejects secret references.
264
265## Authoring Procedure
2661. Start with `agentclash challenge-pack init ... --template prompt_eval` or `--template native`.
2672. Fill `pack` metadata with stable slug/name/family.
2683. Set `version.number: 1` for a new pack and choose `execution_mode`.
2694. Write challenges before cases so every `case.challenge_key` can reference a real challenge.
2705. Add input sets by run purpose: `smoke`, `ci`, `regression`, `full`, or similar.
2716. Add deterministic validators first; add LLM judges only when deterministic evidence cannot capture quality.
2727. Add native-only tools, sandbox, files, assets, and artifact refs only when the execution mode is `native`.
2738. Run validation with and without `--json`.
2749. Hand off to publication only after validation passes.
275
276## Common Validation Failures
277- Missing `pack.family`, `challenge.category`, or `case_key`.
278- Case `challenge_key` does not match any challenge `key`.
279- `difficulty` is not one of `easy`, `medium`, `hard`, or `expert`.
280- A `prompt_eval` pack includes `tools`, `tool_policy`, or `sandbox`.
281- `allowed_tool_kinds` contains unsupported values such as `shell`.
282- Asset or artifact reference keys are missing, duplicated, or point at undeclared version assets.
283- Case expectation `source` is not empty, `input:<case-input-key>`, or `artifact:<version-asset-key>`.
284- File validators do not use a `file:` evidence target.
285- `judge_mode` conflicts with the presence or absence of `llm_judges`.
286
287## Report Back Format
288```text
289YAML file:
290Execution mode:
291Challenges:
292Input sets:
293Scoring mode:
294Native tools/sandbox/assets:
295Validation command:
296Validation result:
297Ready for publish: <yes/no>
298Next skill: agentclash-challenge-pack-validation-publish
299Open issues:
300```
301
302## Related Skills
303- `agentclash-challenge-pack-planner`
304- `agentclash-challenge-pack-input-sets`
305- `agentclash-challenge-pack-scoring-validators`
306- `agentclash-challenge-pack-llm-judges`
307- `agentclash-challenge-pack-tools-sandbox`
308- `agentclash-challenge-pack-artifacts`
309- `agentclash-challenge-pack-validation-publish`