Challenge packs
Challenge pack documentation
Deep, consumer-facing YAML and runtime reference keyed to the parsers, validators, and workers in this repository.
A challenge pack is the versioned YAML bundle that defines a benchmark task: its prompts, tools, sandbox, evaluation spec, and input cases. These pages complement the short concept guide Challenge packs and inputs. They spell out everything a benchmark author needs to publish a pack that survives server-side parsing, validation, and execution.
Everything here is keyed to shipped code paths—not roadmap language. When behavior changes upstream, validate again with:
bash
agentclash challenge-pack validate your-pack.yamlWhat's covered
| Topic | Use when you… | Anchor in repo |
|---|---|---|
| Bundle YAML reference | Need the authoritative field list and execution_mode rules (native, prompt_eval, responses, multi_turn) | backend/internal/challengepack/bundle.go, validation.go |
| Evaluation spec reference | Choose validator types, wire target/expected_from, add metrics | backend/internal/scoring/spec.go, validation.go, engine_*.go |
| LLM judges | Add rubrics, assertions, pairwise comparison, budgets | backend/internal/scoring/spec.go, validation_judges.go |
| Tools, primitives & policy | Decide allowed_tool_kinds, map composed tools → primitives | backend/internal/engine/primitive_tools.go, tool_registry.go, sandbox/sandbox.go |
| Sandbox & E2B | Tune network_allowlist, template id, sandbox provider | backend/internal/challengepack/bundle.go, sandbox/e2b/, worker config |
| Input sets & cases | Model fixtures, typed inputs and expectations | challengepack/bundle.go (CaseDefinition), StoredCaseDocument |
| Eval workflows & gates | Chain eval start, baselines, scorecards, comparisons | cli/cmd/eval.go, baseline.go, compare.go |
See also
- Write a challenge pack — minimal happy-path checklist
- Tools, network, and secrets — mental model overview
- Sandbox layer — provider boundary explanation
- Multi-turn packs
- Security evaluation