Agent Skills
Challenge Pack YAML Author Skill
Use when writing or editing AgentClash challenge pack YAML, including pack/version metadata, execution mode, challenges, cases, input sets, scoring blocks, tools, sandbox settings, assets, and validation handoff.
Canonical source: web/content/agent-skills/challenge-pack-skills/agentclash-challenge-pack-yaml-author/SKILL.md
Markdown export: /docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-yaml-author
Use This Skill When
Use when writing or editing AgentClash challenge pack YAML, including pack/version metadata, execution mode, challenges, cases, input sets, scoring blocks, tools, sandbox settings, assets, and validation handoff.
Full SKILL.md
markdown
1---
2name: agentclash-challenge-pack-yaml-author
3description: Use when writing or editing AgentClash challenge pack YAML, including pack/version metadata, execution mode, challenges, cases, input sets, scoring blocks, tools, sandbox settings, assets, and validation handoff.
4metadata:
5 agentclash.role: challenge-pack-authoring
6 agentclash.version: "1"
7 agentclash.requires_cli: "true"
8---
9
10# AgentClash Challenge Pack YAML Author
11
12## Purpose
13Write challenge-pack YAML that matches the current AgentClash parser and validators, without requiring access to the AgentClash source repo.
14
15Use this skill after planning. The output should be a concrete YAML file plus the exact validation commands a coding agent should run before publish.
16
17## Use When
18- A challenge-pack plan needs to become valid YAML.
19- An existing pack YAML needs source-compatible edits.
20- A coding agent needs the exact YAML object shape for `pack`, `version`, `challenges`, `input_sets`, scoring, tools, sandbox, assets, cases, inputs, and expectations.
21- The agent will later hand the file to `agentclash-challenge-pack-validation-publish`.
22
23## Do Not Use When
24- The user only has a vague eval idea; use `agentclash-challenge-pack-planner` first.
25- The YAML is finished and only needs validation or publish; use `agentclash-challenge-pack-validation-publish`.
26- The task is to start runs or choose deployments; use `agentclash-eval-runner` or deployment skills.
27
28## Environment
29Use hosted production unless the user intentionally points at another backend.
30
31```bash
32export AGENTCLASH_API_URL="https://api.agentclash.dev"
33```
34
35## CLI Commands
36Start from the CLI template when possible, then edit the YAML.
37
38```bash
39agentclash challenge-pack init support-eval.yaml --template prompt_eval --name "Support Eval" --slug support-eval
40agentclash challenge-pack init native-files.yaml --template native --name "Native Files" --slug native-files
41agentclash challenge-pack validate support-eval.yaml
42agentclash challenge-pack validate support-eval.yaml --json
43```
44
45Human validation output prints `Challenge pack is valid` or `Challenge pack has errors`. Use `--json` when a coding agent needs structured fields such as `valid` and `errors`.
46
47## YAML Skeleton
48These are the top-level fields accepted by the bundle parser:
49
50```yaml
51pack:
52 slug: support-eval
53 name: Support Eval
54 family: support
55 description: Evaluates concise customer support answers.
56
57version:
58 number: 1
59 execution_mode: prompt_eval
60 evaluation_spec:
61 name: Support Eval Scoring
62 version_number: 1
63 judge_mode: deterministic
64 validators:
65 - key: mentions_refund_policy
66 type: contains
67 target: final_output
68 expected_from: literal:refund policy
69 scorecard:
70 strategy: weighted
71 dimensions:
72 - key: correctness
73 source: validators
74 weight: 1
75
76challenges:
77 - key: refund-question
78 title: Refund Policy Question
79 category: support
80 difficulty: easy
81 instructions: Answer the customer in a concise, helpful tone.
82
83input_sets:
84 - key: smoke
85 name: Smoke
86 description: Fast validation cases.
87 cases:
88 - challenge_key: refund-question
89 case_key: basic-refund
90 payload:
91 customer_message: Can I get a refund after 14 days?
92 inputs:
93 - key: prompt
94 kind: text
95 value: Can I get a refund after 14 days?
96 expectations:
97 - key: expected_policy
98 kind: text
99 source: input:prompt
100```
101
102## Required Fields
103- `pack.slug`, `pack.name`, and `pack.family` are required.
104- `version.number` must be greater than zero.
105- `version.execution_mode` should be explicit: use `prompt_eval` or `native`.
106- `version.evaluation_spec.validators` must contain at least one validator.
107- Every challenge needs `key`, `title`, `category`, and `difficulty`.
108- `difficulty` must be `easy`, `medium`, `hard`, or `expert`.
109- Every input set needs `key`, `name`, and at least one `cases` entry.
110- Every case needs `challenge_key` referencing a declared challenge and a stable `case_key`.
111- Use `cases`, not legacy `items`, in new YAML.
112
113## Execution Modes
114Use `prompt_eval` for prompt-style tasks where the agent only needs the prompt and final output.
115
116`prompt_eval` cannot use:
117- top-level `tools`
118- `version.tool_policy`
119- `version.sandbox`
120
121Use `native` when the challenge needs files, tools, network policy, package installation, file validators, directory checks, code execution, or sandbox behavior.
122
123## Native Tools And Sandbox
124Only include these blocks for `native` packs.
125
126```yaml
127tools:
128 custom:
129 - name: lookup_order
130 description: Looks up an order by ID.
131 parameters:
132 type: object
133 properties:
134 order_id:
135 type: string
136 required:
137 - order_id
138 implementation:
139 primitive: http_request
140 args:
141 method: GET
142 url: "https://example.test/orders/${order_id}"
143 headers:
144 Authorization: "Bearer ${secrets.ORDER_API_KEY}"
145
146version:
147 number: 1
148 execution_mode: native
149 tool_policy:
150 allowed_tool_kinds:
151 - browser
152 - file
153 - network
154 sandbox:
155 network_access: true
156 network_allowlist:
157 - 203.0.113.0/24
158 env_vars:
159 DATASET_MODE: fixture
160 additional_packages:
161 - jq
162```
163
164Supported `version.tool_policy.allowed_tool_kinds` values are exactly `browser`, `build`, `data`, `file`, and `network`. Do not use `shell` as an allowed tool kind.
165
166For `tools.custom`, each tool needs `name`, `parameters`, and `implementation`. Non-`mock` implementations need `implementation.primitive` and `implementation.args`. Template placeholders in `args` use `${parameter_name}` for declared parameters and may reference `${secrets.SECRET_KEY}` when the runtime provides that secret; never paste raw secret values into YAML.
167
168Sandbox rules:
169- `network_allowlist` entries must be valid CIDR ranges.
170- `env_vars` keys must look like shell env names, for example `DATASET_MODE`.
171- `additional_packages` entries must be valid apt-style package names.
172- Never put raw secrets in YAML. Name required secret keys in notes and configure them through the workspace/runtime/provider flow.
173
174## Assets, Inputs, Expectations, And Artifacts
175Assets may appear on `version`, challenges, or cases. Each asset list must use unique `key` values, and every asset needs `path`.
176
177```yaml
178version:
179 assets:
180 - key: policy_pdf
181 kind: file
182 path: assets/policy.pdf
183 media_type: application/pdf
184
185challenges:
186 - key: summarize-policy
187 title: Summarize Policy
188 category: documents
189 difficulty: medium
190 artifact_refs:
191 - key: policy_pdf
192
193input_sets:
194 - key: full
195 name: Full
196 cases:
197 - challenge_key: summarize-policy
198 case_key: policy-summary
199 inputs:
200 - key: source_doc
201 kind: file
202 artifact_key: policy_pdf
203 path: assets/policy.pdf
204 expectations:
205 - key: summary_requirements
206 kind: text
207 value: Mention refund window and exclusions.
208```
209
210Case input fields are `key`, `kind`, optional `value`, optional `artifact_key`, and optional `path`.
211
212Case expectation fields are `key`, `kind`, optional `value`, optional `artifact_key`, and optional `source`. Supported `source` values are empty, `input:<case-input-key>`, or `artifact:<version-asset-key>`, for example `source: input:prompt`.
213
214## Evaluation Spec
215`evaluation_spec` controls scoring. Keep deterministic checks deterministic, and use LLM judges only when subjective quality is genuinely needed.
216
217```yaml
218evaluation_spec:
219 name: Support Eval Scoring
220 version_number: 1
221 judge_mode: hybrid
222 validators:
223 - key: has_json
224 type: json_schema
225 target: final_output
226 expected_from: 'literal:{"type":"object","required":["answer"]}'
227 llm_judges:
228 - key: helpfulness
229 mode: rubric
230 model: gpt-4.1
231 rubric: Judge whether the answer is helpful, grounded, and concise.
232 context_from:
233 - challenge_input
234 - final_output
235 scorecard:
236 strategy: hybrid
237 dimensions:
238 - key: schema_gate
239 source: validators
240 validators:
241 - has_json
242 weight: 1
243 gate: true
244 pass_threshold: 1
245 - key: helpfulness
246 source: llm_judge
247 judge_key: helpfulness
248 weight: 1
249```
250
251Supported `judge_mode` values are `deterministic`, `llm_judge`, and `hybrid`.
252
253Supported validator types include `exact_match`, `contains`, `regex_match`, `json_schema`, `json_path_match`, `boolean_assert`, `fuzzy_match`, `numeric_match`, `normalized_match`, `token_f1`, `math_equivalence`, `bleu_score`, `rouge_score`, `chrf_score`, `file_content_match`, `file_exists`, `file_json_schema`, `directory_structure`, `code_execution`, `tool_call_assertion`, and `postcondition`.
254
255Evidence references accepted by validators and judges include `final_output`, `run.final_output`, `challenge_input`, `case.payload`, `case.payload.<path>`, `case.inputs.<path>`, `case.expectations.<path>`, `artifact.<path>`, `file:<post_execution_check_key>`, and `literal:<value>`.
256
257File validators require a `file:` target. `code_execution` validators must target a `post_execution_checks` entry of type `file_capture`. `tool_call_assertion` validators must target `tool_calls` and omit `expected_from`. `postcondition` validators target `file:<post_execution_check_key>`, omit `expected_from`, and use `config.condition` for declarative post-run checks.
258
259For `judge_mode: deterministic`, omit `llm_judges`. For `judge_mode: llm_judge`, include at least one judge. For `judge_mode: hybrid`, include validators and at least one judge; hybrid scorecards need a gated dimension.
260
261For scorecard dimensions with `source: validators`, omit the `validators` list only when the dimension should score every validator. Add `validators: [<validator_key>]` when the dimension should cover a specific subset.
262
263Do not put `${secrets.*}` references in LLM judge `rubric`, `assertion`, or `prompt` text. Secrets are allowed in native tool implementation args when the runtime provides them, but judge prompt text rejects secret references.
264
265## Authoring Procedure
2661. Start with `agentclash challenge-pack init ... --template prompt_eval` or `--template native`.
2672. Fill `pack` metadata with stable slug/name/family.
2683. Set `version.number: 1` for a new pack and choose `execution_mode`.
2694. Write challenges before cases so every `case.challenge_key` can reference a real challenge.
2705. Add input sets by run purpose: `smoke`, `ci`, `regression`, `full`, or similar.
2716. Add deterministic validators first; add LLM judges only when deterministic evidence cannot capture quality.
2727. Add native-only tools, sandbox, files, assets, and artifact refs only when the execution mode is `native`.
2738. Run validation with and without `--json`.
2749. Hand off to publication only after validation passes.
275
276## Common Validation Failures
277- Missing `pack.family`, `challenge.category`, or `case_key`.
278- Case `challenge_key` does not match any challenge `key`.
279- `difficulty` is not one of `easy`, `medium`, `hard`, or `expert`.
280- A `prompt_eval` pack includes `tools`, `tool_policy`, or `sandbox`.
281- `allowed_tool_kinds` contains unsupported values such as `shell`.
282- Asset or artifact reference keys are missing, duplicated, or point at undeclared version assets.
283- Case expectation `source` is not empty, `input:<case-input-key>`, or `artifact:<version-asset-key>`.
284- File validators do not use a `file:` evidence target.
285- `judge_mode` conflicts with the presence or absence of `llm_judges`.
286
287## Report Back Format
288```text
289YAML file:
290Execution mode:
291Challenges:
292Input sets:
293Scoring mode:
294Native tools/sandbox/assets:
295Validation command:
296Validation result:
297Ready for publish: <yes/no>
298Next skill: agentclash-challenge-pack-validation-publish
299Open issues:
300```
301
302## Related Skills
303- `agentclash-challenge-pack-planner`
304- `agentclash-challenge-pack-input-sets`
305- `agentclash-challenge-pack-scoring-validators`
306- `agentclash-challenge-pack-llm-judges`
307- `agentclash-challenge-pack-tools-sandbox`
308- `agentclash-challenge-pack-artifacts`
309- `agentclash-challenge-pack-validation-publish`