Challenge packs

Input sets & cases

How cases bind challenges, structured inputs, expectations, assets, and legacy payloads—grounded in challengepack.CaseDefinition.

Input sets are the unit AgentClash schedules per deployment/candidate. Each input_sets[] entry contains a list of cases[] — the individual tasks an agent runs (defined by CaseDefinition in backend/internal/challengepack/bundle.go).

Case identity

  • challenge_key — must reference an existing challenges[].key
  • case_key / legacy item_key — both accepted; normalization duplicates missing side from the other

All cases in one input_sets[] entry must reference the same challenge_key; split mixed-challenge suites into separate input sets.

When both keys are present, case_key is the one stored and used to identify the row (item_key is the fallback). Internally this is EffectiveKey().

Three authoring styles (coexist)

  1. Legacy payload-only — fill payload map; omit structured inputs/expectations
  2. Structured evalinputs[] + expectations[] with explicit kind fields
  3. Artifact heavyassets[] + artifacts[] referencing declared version/challenge assets

A case counts as style (1) when it has no inputs, expectations, artifacts, assets, or user_simulator — just a raw payload. Such cases are stored as the bare payload blob for backward compatibility (IsLegacyPayloadOnly).

Stored document shape

When any modern field is present, the case is stored as a StoredCaseDocument JSON object tagged schema_version: 1 (StoredPayload()), preserving:

  • payload
  • inputs
  • expectations
  • artifacts
  • assets

This is what scoring + replay pull back—not the raw YAML fragment.

Example input set

A single input set with one case that combines a payload blob with expectations (lifted from examples/challenge-packs/incident-response-llm-judge.yaml):

yaml
1input_sets:
2  - key: default
3    name: Default Input Set
4    cases:
5      - challenge_key: payments-api-outage
6        case_key: ambiguous-sev1
7        payload:
8          incident_summary: |
9            The payments API is returning elevated 5xx errors in one region.
10          signals:
11            - "5xx error rate increased from 0.4% to 18% in us-east-1"
12            - "database p95 latency increased from 40ms to 480ms"
13        expectations:
14          - key: escalation_policy
15            kind: text
16            value: |
17              Escalate immediately for potentially high-severity, multi-signal
18              incidents with uncertain root cause.

Case inputs (inputs[])

Each entry in inputs[] carries these fields (CaseInput):

FieldRole
keyStable id for templates / UI
kindDrives rendering + validator binding (text, artifact, etc.—product-specific kinds should match worker expectations)
valueInline scalar/object
artifact_keyPull bytes from declared asset map
pathOptional relative path inside asset bundle

Validators can address values through case.inputs.<key> evidence paths.

Expectations (expectations[])

Each expectations[] entry mirrors an input (CaseExpectation):

  • key, kind, value, artifact_key, plus source telling graders where dynamic gold values originate (input:prompt pattern seen in CLI template packs)

Use expectations for:

  • deterministic string compares
  • supplying LLM judge reference_from bindings
  • filesystem validators comparing outputs to expected files

Assets on cases

Case-level assets[] references use the same AssetReference structure as version-level entries (key, path, optional artifact_id). Validation ensures cross-references exist before publish succeeds.

Input set metadata

Optional description on an input set is preserved for UI/discovery; there is no behavioral magic—selection happens by id/key at run creation time.

Choosing input set at run time

CLI eval start accepts --input-set when multiple sets exist; otherwise TTY flows prompt. API consumers pass the chosen challenge_input_set_id when creating runs (see OpenAPI CreateRun family). If omitted and the pack has exactly one input set, it is auto-selected; if the pack has multiple, the request is rejected.

See also