Challenge packs

Bundle YAML reference

Structured reference for challenge-pack bundles as parsed by backend/internal/challengepack.

AgentClash challenge packs ship as one YAML file decoded into challengepack.Bundle (backend/internal/challengepack/bundle.go). Publication stores a JSON manifest combining the pack body with metadata for execution and scoring (ManifestJSON in the same file).

Top-level keys

All keys listed here are persisted or validated somewhere in publish/validate—not decorative.

`pack` (required)

Human metadata surfaced in API and CLI lists:

slug — required, workspace-unique branding string
name — required display name
family — required grouping/category string
description — optional

`version` (required)

The executable spine of the bundle:

Field	Notes
`number`	Required positive integer (`int32`). Bumps when you materially change validators, tooling, sandbox, scores, etc.
`execution_mode`	One of `native`, `prompt_eval`, `responses`, or `multi_turn`. An empty value is accepted as legacy, but always set it explicitly. See execution mode compatibility below.
`tool_policy`	Arbitrary-shaped map mirrored into manifest `tool_policy`. Must be empty when `execution_mode` is `prompt_eval` (enforced when the bundle is validated).
`filesystem`	Optional filesystem constraints blob (same manifest path as today).
`sandbox`	Optional `SandboxConfig`. Forbidden when `execution_mode` is `prompt_eval`.
`evaluation_spec`	Required scoring contract unmarshalled through scoring package strict decode (`scoring.StrictDecodeEvaluationSpec`). Typos surface at parse time rather than silently defaulting.
`assets`	Version-scoped `AssetReference[]` keyed for cases and uploads.

SandboxConfig (bundle.go) currently supports:

network_access (bool)
network_allowlist (CIDR strings; invalid CIDR rejects publish)
env_vars (map of string literals—see Sandbox & E2B)
additional_packages (APT-style names constrained by regexp in validation)
sandbox_template_id (optional provider template override)

Manifest JSON merges sandbox_template_id from version.sandbox into the serialized version block for backends that historically keyed off that field separately.

`tools` (optional)

Optional map keyed by integration style; authoring today uses tools.custom as an array of composed tools. Must be empty when execution_mode is prompt_eval or responses.

See Tools, primitives & policy.

`challenges` (required, non-empty)

Each challenge includes:

Field	Required	Purpose
`key`	yes	Stable id referenced by cases
`title`	yes	Display
`category`	yes	Stored metadata
`difficulty`	yes	Must be one of `easy`, `medium`, `hard`, or `expert` (any other value fails publish with "must be one of easy, medium, hard, expert")
`instructions`	often	Prompt body; mirrored into `definition.instructions` if missing there
`definition`	optional	Extra JSON-compatible bag for product-specific authoring
`assets`	optional	Challenge-scoped files
`artifact_refs`	optional	Artifact key references validated against declared artifacts

`input_sets` (required)

Defines runnable cases:

key and name required on each set.
Prefer modern cases array. Legacy items aliases are normalized into cases during normalizeBundle; do not rely on items in new authoring.
All cases in a single input set must reference the same challenge_key; use separate input sets for separate challenges.

Cases are modeled by CaseDefinition with challenge_key, case_key, optional rich inputs/expectations, artifacts, assets, plus legacy payload for blob-only authoring.

Deep dive: Input sets & cases.

Execution mode compatibility

execution_mode accepts four values; each constrains which other blocks may appear (enforced when the bundle is validated):

native — the full runtime. You may populate version.sandbox, allowed tool kinds, and composed tools; the worker hydrates the richer runtime path. Choose this when you rely on sandbox files, primitives, or validators that read captured files (file: evidence).
prompt_eval — pure model-output workloads. The tools block must not be present, version.sandbox must not be present, and version.tool_policy must be empty.
responses — OpenAI Responses-API workloads. The tools block must be empty (use OpenAI Responses tools, not pack-defined tools).
multi_turn — scripted or simulated multi-turn conversations. See Multi-turn packs for the per-case user_simulator manifest.

Choose prompt_eval for pure model-output workloads; promote to native when you need sandbox files, primitives, or file-based evidence.

After publish — IDs returned

Publishing returns stable identifiers referenced by runs (see authoring guide):

challenge_pack_id
challenge_pack_version_id
evaluation_spec_id
input_set_ids
Optional bundle_artifact_id

Run creation binds challenge_pack_version_id specifically—YAML filenames stop mattering immediately after publish.

Top-level keys

pack (required)

version (required)

tools (optional)

challenges (required, non-empty)

input_sets (required)