Agent Skills

Challenge Pack Planner Skill

Use when turning a vague AgentClash evaluation idea into a source-backed challenge pack plan with task boundaries, target agents, cases, input sets, scoring strategy, tools, artifacts, runtime policy, validation criteria, and handoff steps.

Canonical source: web/content/agent-skills/challenge-pack-skills/agentclash-challenge-pack-planner/SKILL.md

Markdown export: /docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-planner

Use This Skill When

Use when turning a vague AgentClash evaluation idea into a source-backed challenge pack plan with task boundaries, target agents, cases, input sets, scoring strategy, tools, artifacts, runtime policy, validation criteria, and handoff steps.

Full SKILL.md

markdown
1---
2name: agentclash-challenge-pack-planner
3description: Use when turning a vague AgentClash evaluation idea into a source-backed challenge pack plan with task boundaries, target agents, cases, input sets, scoring strategy, tools, artifacts, runtime policy, validation criteria, and handoff steps.
4metadata:
5  agentclash.role: challenge-pack-planning
6  agentclash.version: "1"
7  agentclash.requires_cli: "false"
8---
9
10# AgentClash Challenge Pack Planner
11
12## Purpose
13Turn an eval idea into a concrete challenge-pack plan before anyone writes YAML.
14
15Use this skill to produce the planning artifact that downstream skills can convert into an AgentClash challenge pack. Planning does not require CLI access, but the plan must match the current challenge-pack model: `pack`, `version`, optional top-level `tools`, `challenges`, and `input_sets`.
16
17## Use When
18- A user describes a benchmark, regression idea, eval suite, or "can we test this agent?" scenario.
19- The workload needs boundaries before YAML authoring: what behavior is tested, what cases exist, what counts as success, and which tools/files are allowed.
20- A coding agent needs enough AgentClash product context to plan a pack without reading the AgentClash source repo.
21- You need a handoff to `agentclash-challenge-pack-yaml-author`, `agentclash-challenge-pack-input-sets`, `agentclash-challenge-pack-scoring-validators`, or `agentclash-challenge-pack-llm-judges`.
22
23## Do Not Use When
24- The user already has a finished plan and wants valid YAML; use `agentclash-challenge-pack-yaml-author`.
25- The task is only to validate or publish a YAML file; use `agentclash-challenge-pack-validation-publish`.
26- The task is to choose deployments or start runs; use `agentclash-agent-deployment-setup` or `agentclash-eval-runner`.
27- The user needs CLI installation, auth, workspace linking, or hosted setup; use `agentclash-cli-setup`.
28
29## Inputs Needed
30- Evaluation goal: the behavior, capability, or failure mode the pack should expose.
31- Target agent class: coding agent, support bot, research agent, workflow agent, extraction agent, etc.
32- Good, bad, and borderline outputs.
33- Expected evidence: final text, JSON fields, captured files/directories, artifacts, metrics, latency, cost, or judge rationale.
34- Case inventory: representative, edge, adversarial, regression, and smoke examples.
35- Execution needs: `prompt_eval` versus `native`, tools, network, files, packages, secrets, and time budget.
36- Release intent: exploration, regression suite, CI gate, public comparison, or customer demo.
37
38## Planning Procedure
391. State the pack boundary in one sentence: what is being tested and what is explicitly out of scope.
402. Choose execution mode:
41   - `prompt_eval` for prompt-style tasks that do not need pack-defined tools, sandbox config, or native file/tool execution.
42   - `native` when the agent must use files, tools, sandbox policy, network, packages, artifacts, or code/file validators.
433. Define one or more `challenges`. Each challenge should have a stable `key`, title, category, difficulty (`easy`, `medium`, `hard`, or `expert`), and instructions.
444. Design cases before scoring. For each case, define a stable `case_key`, the `challenge_key` it targets, concrete inputs, expected outputs or expectations, and why the case exists.
455. Group cases into `input_sets`: at minimum `smoke` or `default`; add `full`, `regression`, or `ci` only when their purpose and budget differ. Each input set must contain cases for a single `challenge_key`; split mixed-challenge suites into separate input sets.
466. Pick evidence sources. Decide whether success is visible in `final_output`, structured JSON, files, artifacts, tool behavior, metrics, or LLM-judge rationale.
477. Choose scoring:
48   - deterministic validators for exact, regex, JSON, numeric, token overlap, math, file, directory, or code-execution checks.
49   - LLM judges for subjective quality where deterministic checks cannot honestly capture the behavior.
50   - hybrid scoring when hard gates and qualitative judgment both matter.
518. Decide runtime policy only if needed: allowed tool kinds, sandbox network access, package needs, file assets, and secrets. Keep the policy as narrow as the workload allows.
529. Define publish criteria: what must be true before the pack can be validated, published, and used in an eval run.
5310. Produce a handoff plan naming the next skill and the missing information, if any.
54
55## Challenge Pack Model
56Use these product nouns consistently:
57
58- `pack`: human metadata such as `slug`, `name`, `family`, and optional description.
59- `version`: executable version data: `number`, `execution_mode`, `evaluation_spec`, and optional `tool_policy`, `filesystem`, `sandbox`, and `assets`.
60- `tools`: optional top-level pack-defined composed tools. Do not plan these for `prompt_eval`.
61- `challenges`: task definitions. Cases reference them by `challenge_key`.
62- `input_sets`: named groups of runnable cases.
63- `cases`: concrete workload items with `case_key`, `payload`, `inputs`, `expectations`, case `artifacts`, and case-local `assets` as needed.
64- `evaluation_spec`: score contract with `judge_mode`, validators, optional metrics or LLM judges, runtime limits, pricing, and scorecard dimensions.
65
66## Planning Heuristics
67- Prefer fewer, sharper challenges over a broad pack that mixes unrelated behaviors.
68- Prefer small smoke sets that fail fast and full sets that measure coverage.
69- Each case should teach the evaluator something unique. Duplicate cases need a reason, such as variance or regression coverage.
70- Make expectations observable. "Looks good" is not enough; specify the evidence path and what a pass means.
71- Use deterministic validators for hard facts, schemas, files, and code behavior. Use LLM judges for judgement calls like helpfulness, prioritization, style, tradeoff quality, or incident reasoning.
72- Use `native` only when the task truly needs sandbox/files/tools. Simpler packs are easier to validate and reuse.
73- Do not put raw secrets in the plan. Name the required secret keys and say they must be provided through workspace secrets or runtime/provider configuration.
74- Do not invent IDs. Planning should name resources by role until validation/publish creates real IDs.
75
76## Execution Mode Decision Table
77| Need | Plan |
78| --- | --- |
79| Single prompt and final text answer | `prompt_eval` |
80| Structured extraction from text input | `prompt_eval` unless file tooling is required |
81| Agent must read/write files, run tests, or produce artifacts | `native` |
82| Pack-defined custom tools | `native` |
83| Network access or extra packages | `native` with explicit sandbox policy |
84| Code, file, directory, or artifact-backed validators | `native` |
85| Pure qualitative grading | `prompt_eval` or `native`, plus `llm_judge` depending on execution needs |
86
87## Scoring Plan Shape
88For each scoring dimension, specify:
89
90```text
91Dimension: <stable key>
92Source: validators | metric | reliability | latency | cost | behavioral | llm_judge
93Evidence: <final_output | file:path | artifact key | metric collector | judge key>
94Pass rule: <threshold, gate, or qualitative rubric>
95Failure message: <what should be reported when this fails>
96```
97
98Use `judge_mode: deterministic` when validators and metrics are sufficient. Use `judge_mode: llm_judge` when judges are the main grading surface. Use `judge_mode: hybrid` when deterministic gates and LLM-judge dimensions both matter.
99
100## Case Coverage Checklist
101- Happy path: the most ordinary success case.
102- Edge case: unusual but valid input.
103- Negative or guardrail case: input that should be rejected, abstained from, or handled safely.
104- Ambiguity case: forces prioritization or asks for clarification when appropriate.
105- Regression case: a known previous failure, with the evidence that should prevent recurrence.
106- Budget case: confirms the pack can run within intended time, tool, and cost limits.
107
108## Tool, Sandbox, And Artifact Planning
109Only include these when the workload needs them.
110
111- Allowed tool kinds in `version.tool_policy.allowed_tool_kinds` must use supported broad kinds such as `browser`, `build`, `data`, `file`, and `network`.
112- `version.sandbox.network_access` should stay false unless the task needs outbound network.
113- `version.sandbox.network_allowlist` should be specific when network is needed.
114- `version.sandbox.additional_packages` should name only packages required by the workload or validators.
115- Version, challenge, and case assets should have stable `key` and `path`; artifact-backed assets also need an `artifact_id` after upload.
116- Case expectations can use `value`, `artifact_key`, or `source`. Supported `source` values are empty, `input:<case-input-key>`, or `artifact:<version-asset-key>`.
117
118## Output Format
119```text
120Pack name:
121Slug/family:
122Goal:
123Out of scope:
124Target agent:
125Execution mode: <prompt_eval | native>
126
127Challenges:
128- key:
129  title:
130  category:
131  difficulty:
132  instructions summary:
133
134Input sets:
135- key:
136  purpose:
137  cases:
138    - case_key:
139      challenge_key:
140      inputs:
141      expectations:
142      reason:
143
144Scoring:
145- dimension:
146  judge mode:
147  validators:
148  llm judges:
149  gates/thresholds:
150  evidence:
151
152Runtime policy:
153Tools:
154Assets/artifacts:
155Secrets:
156Publish criteria:
157Risks/blockers:
158Next skill:
159```
160
161## Failure Modes
162- The plan has no concrete cases: ask for examples or create explicit draft cases from the user's scenario.
163- Cases are not tied to a `challenge_key`: add the missing challenge structure before YAML authoring.
164- The plan says `prompt_eval` but needs tools, sandbox, files, or network: switch to `native`.
165- The scoring is subjective but only uses exact validators: add an LLM judge or rewrite the expected output into deterministic evidence.
166- The scoring is objective but only uses LLM judges: replace with validators where possible.
167- Input sets mix smoke, regression, and full benchmark cases without purpose: split them by run intent.
168- The plan depends on secrets or private data: name secret keys and artifact roles, not raw values.
169
170## Report Back Format
171```text
172Planned pack: <name>
173Execution mode: <prompt_eval | native>
174Challenge count: <n>
175Case count: <n>
176Input sets: <keys>
177Scoring mode: <deterministic | llm_judge | hybrid>
178Needs tools/sandbox: <yes/no + why>
179Needs assets/artifacts: <yes/no + what>
180Needs secrets: <yes/no + names only>
181Ready for YAML authoring: <yes/no>
182Next skill: <agentclash-challenge-pack-yaml-author | other>
183Open questions: <blocking details>
184```
185
186## Related Skills
187- `agentclash-cli-setup`
188- `agentclash-challenge-pack-yaml-author`
189- `agentclash-challenge-pack-input-sets`
190- `agentclash-challenge-pack-tools-sandbox`
191- `agentclash-challenge-pack-artifacts`
192- `agentclash-challenge-pack-scoring-validators`
193- `agentclash-challenge-pack-llm-judges`
194- `agentclash-challenge-pack-validation-publish`
195
196## Related Docs
197- `/docs-md/concepts/challenge-packs-and-inputs`
198- `/docs-md/guides/write-a-challenge-pack`
199- `/docs-md/concepts/tools-network-and-secrets`
200- `/docs-md/concepts/artifacts`
201- `/docs-md/reference/cli`