Challenge packs

Challenge pack documentation

Deep, consumer-facing YAML and runtime reference keyed to the parsers, validators, and workers in this repository.

A challenge pack is the versioned YAML bundle that defines a benchmark task: its prompts, tools, sandbox, evaluation spec, and input cases. These pages complement the short concept guide Challenge packs and inputs. They spell out everything a benchmark author needs to publish a pack that survives server-side parsing, validation, and execution.

Everything here is keyed to shipped code paths—not roadmap language. When behavior changes upstream, validate again with:

bash

agentclash challenge-pack validate your-pack.yaml

What's covered

Topic	Use when you…	Anchor in repo
Bundle YAML reference	Need the authoritative field list and `execution_mode` rules (`native`, `prompt_eval`, `responses`, `multi_turn`)	`backend/internal/challengepack/bundle.go`, `validation.go`
Evaluation spec reference	Choose validator types, wire `target`/`expected_from`, add metrics	`backend/internal/scoring/spec.go`, `validation.go`, `engine_*.go`
LLM judges	Add rubrics, assertions, pairwise comparison, budgets	`backend/internal/scoring/spec.go`, `validation_judges.go`
Tools, primitives & policy	Decide `allowed_tool_kinds`, map composed tools → primitives	`backend/internal/engine/primitive_tools.go`, `tool_registry.go`, `sandbox/sandbox.go`
Sandbox & E2B	Tune network_allowlist, template id, sandbox provider	`backend/internal/challengepack/bundle.go`, `sandbox/e2b/`, worker config
Input sets & cases	Model fixtures, typed inputs and expectations	`challengepack/bundle.go` (`CaseDefinition`), `StoredCaseDocument`
Eval workflows & gates	Chain `eval start`, baselines, scorecards, comparisons	`cli/cmd/eval.go`, `baseline.go`, `compare.go`