Challenge packs
Input sets & cases
How cases bind challenges, structured inputs, expectations, assets, and legacy payloads—grounded in challengepack.CaseDefinition.
Input sets are the unit AgentClash schedules per deployment/candidate. Each input_sets[] entry contains a list of cases[] — the individual tasks an agent runs (defined by CaseDefinition in backend/internal/challengepack/bundle.go).
Case identity
challenge_key— must reference an existingchallenges[].keycase_key/ legacyitem_key— both accepted; normalization duplicates missing side from the other
All cases in one input_sets[] entry must reference the same challenge_key; split mixed-challenge suites into separate input sets.
When both keys are present, case_key is the one stored and used to identify the row (item_key is the fallback). Internally this is EffectiveKey().
Three authoring styles (coexist)
- Legacy payload-only — fill
payloadmap; omit structured inputs/expectations - Structured eval —
inputs[]+expectations[]with explicitkindfields - Artifact heavy —
assets[]+artifacts[]referencing declared version/challenge assets
A case counts as style (1) when it has no inputs, expectations, artifacts, assets, or user_simulator — just a raw payload. Such cases are stored as the bare payload blob for backward compatibility (IsLegacyPayloadOnly).
Stored document shape
When any modern field is present, the case is stored as a StoredCaseDocument JSON object tagged schema_version: 1 (StoredPayload()), preserving:
payloadinputsexpectationsartifactsassets
This is what scoring + replay pull back—not the raw YAML fragment.
Example input set
A single input set with one case that combines a payload blob with expectations (lifted from examples/challenge-packs/incident-response-llm-judge.yaml):
1input_sets:
2 - key: default
3 name: Default Input Set
4 cases:
5 - challenge_key: payments-api-outage
6 case_key: ambiguous-sev1
7 payload:
8 incident_summary: |
9 The payments API is returning elevated 5xx errors in one region.
10 signals:
11 - "5xx error rate increased from 0.4% to 18% in us-east-1"
12 - "database p95 latency increased from 40ms to 480ms"
13 expectations:
14 - key: escalation_policy
15 kind: text
16 value: |
17 Escalate immediately for potentially high-severity, multi-signal
18 incidents with uncertain root cause.Case inputs (inputs[])
Each entry in inputs[] carries these fields (CaseInput):
| Field | Role |
|---|---|
key | Stable id for templates / UI |
kind | Drives rendering + validator binding (text, artifact, etc.—product-specific kinds should match worker expectations) |
value | Inline scalar/object |
artifact_key | Pull bytes from declared asset map |
path | Optional relative path inside asset bundle |
Validators can address values through case.inputs.<key> evidence paths.
Expectations (expectations[])
Each expectations[] entry mirrors an input (CaseExpectation):
key,kind,value,artifact_key, plussourcetelling graders where dynamic gold values originate (input:promptpattern seen in CLI template packs)
Use expectations for:
- deterministic string compares
- supplying LLM judge
reference_frombindings - filesystem validators comparing outputs to expected files
Assets on cases
Case-level assets[] references use the same AssetReference structure as version-level entries (key, path, optional artifact_id). Validation ensures cross-references exist before publish succeeds.
Input set metadata
Optional description on an input set is preserved for UI/discovery; there is no behavioral magic—selection happens by id/key at run creation time.
Choosing input set at run time
CLI eval start accepts --input-set when multiple sets exist; otherwise TTY flows prompt. API consumers pass the chosen challenge_input_set_id when creating runs (see OpenAPI CreateRun family). If omitted and the pack has exactly one input set, it is auto-selected; if the pack has multiple, the request is rejected.