Challenge packs

Bundle YAML reference

Structured reference for challenge-pack bundles as parsed by backend/internal/challengepack.

AgentClash challenge packs ship as one YAML file decoded into challengepack.Bundle (backend/internal/challengepack/bundle.go). Publication stores a JSON manifest combining the pack body with metadata for execution and scoring (ManifestJSON in the same file).

Top-level keys

All keys listed here are persisted or validated somewhere in publish/validate—not decorative.

pack (required)

Human metadata surfaced in API and CLI lists:

  • slug — required, workspace-unique branding string
  • name — required display name
  • family — required grouping/category string
  • description — optional

version (required)

The executable spine of the bundle:

FieldNotes
numberRequired positive integer (int32). Bumps when you materially change validators, tooling, sandbox, scores, etc.
execution_modeOne of native, prompt_eval, responses, or multi_turn. An empty value is accepted as legacy, but always set it explicitly. See execution mode compatibility below.
tool_policyArbitrary-shaped map mirrored into manifest tool_policy. Must be empty when execution_mode is prompt_eval (enforced when the bundle is validated).
filesystemOptional filesystem constraints blob (same manifest path as today).
sandboxOptional SandboxConfig. Forbidden when execution_mode is prompt_eval.
evaluation_specRequired scoring contract unmarshalled through scoring package strict decode (scoring.StrictDecodeEvaluationSpec). Typos surface at parse time rather than silently defaulting.
assetsVersion-scoped AssetReference[] keyed for cases and uploads.

SandboxConfig (bundle.go) currently supports:

  • network_access (bool)
  • network_allowlist (CIDR strings; invalid CIDR rejects publish)
  • env_vars (map of string literals—see Sandbox & E2B)
  • additional_packages (APT-style names constrained by regexp in validation)
  • sandbox_template_id (optional provider template override)

Manifest JSON merges sandbox_template_id from version.sandbox into the serialized version block for backends that historically keyed off that field separately.

tools (optional)

Optional map keyed by integration style; authoring today uses tools.custom as an array of composed tools. Must be empty when execution_mode is prompt_eval or responses.

See Tools, primitives & policy.

challenges (required, non-empty)

Each challenge includes:

FieldRequiredPurpose
keyyesStable id referenced by cases
titleyesDisplay
categoryyesStored metadata
difficultyyesMust be one of easy, medium, hard, or expert (any other value fails publish with "must be one of easy, medium, hard, expert")
instructionsoftenPrompt body; mirrored into definition.instructions if missing there
definitionoptionalExtra JSON-compatible bag for product-specific authoring
assetsoptionalChallenge-scoped files
artifact_refsoptionalArtifact key references validated against declared artifacts

input_sets (required)

Defines runnable cases:

  • key and name required on each set.
  • Prefer modern cases array. Legacy items aliases are normalized into cases during normalizeBundle; do not rely on items in new authoring.
  • All cases in a single input set must reference the same challenge_key; use separate input sets for separate challenges.

Cases are modeled by CaseDefinition with challenge_key, case_key, optional rich inputs/expectations, artifacts, assets, plus legacy payload for blob-only authoring.

Deep dive: Input sets & cases.

Execution mode compatibility

execution_mode accepts four values; each constrains which other blocks may appear (enforced when the bundle is validated):

  • native — the full runtime. You may populate version.sandbox, allowed tool kinds, and composed tools; the worker hydrates the richer runtime path. Choose this when you rely on sandbox files, primitives, or validators that read captured files (file: evidence).
  • prompt_eval — pure model-output workloads. The tools block must not be present, version.sandbox must not be present, and version.tool_policy must be empty.
  • responses — OpenAI Responses-API workloads. The tools block must be empty (use OpenAI Responses tools, not pack-defined tools).
  • multi_turn — scripted or simulated multi-turn conversations. See Multi-turn packs for the per-case user_simulator manifest.

Choose prompt_eval for pure model-output workloads; promote to native when you need sandbox files, primitives, or file-based evidence.

After publish — IDs returned

Publishing returns stable identifiers referenced by runs (see authoring guide):

  • challenge_pack_id
  • challenge_pack_version_id
  • evaluation_spec_id
  • input_set_ids
  • Optional bundle_artifact_id

Run creation binds challenge_pack_version_id specifically—YAML filenames stop mattering immediately after publish.

See also