Challenge packs
Eval workflows & gates
CLI-first eval commands, baselines, scorecards, comparisons, and release gates as implemented in cli/cmd.
Challenge packs are useless until a run binds a challenge_pack_version_id to one or more deployments. The product ships a workflow-oriented CLI path so you rarely hand-copy UUIDs.
Happy path commands
From cli/cmd/eval.go, baseline.go, compare.go, release_gate.go:
agentclash eval start --follow
agentclash baseline set [run_id] [--agent <label>]
agentclash eval scorecard [run_id] [--agent <label>] [--json]
agentclash compare runs --baseline <run> --candidate <run>
agentclash compare gate --baseline <run> --candidate <run>
agentclash release-gate list [--baseline ... --candidate ...]
eval start
Key flags (see eval.go / eval_resolve.go):
| Flag | Purpose |
| --- | --- |
| --pack | Pack id, slug, or exact name |
| --pack-version | Version id or integer |
| --input-set | Disambiguates when multiple sets published |
| --deployment | Repeatable; accepts id or exact name |
| --follow | Stream run events after creation |
| --scope | full vs suite_only regression scoping |
| --suite / --case | Target regression fixtures |
| --race-context | Peer standings injection (multi-agent) |
Non-interactive environments must supply enough disambiguators—resolver errors spell out what’s missing.
eval scorecard
Prints the latest candidate scorecard and, when configured, enriches with baseline + comparison + release gate envelopes in one JSON payload for CI (eval_test.go asserts those keys).
Omitting run_id uses deterministic “latest relevant run” semantics documented in tests—do not assume hidden state in automation; pass explicit ids in CI.
Baseline bookmarks
baseline set|show|clear stores a workspace-scoped pointer to a run (and optional specific run_agent). This unlocks diff language inside eval scorecard without retyping ids.
doctor treats missing baseline as informational only—CI gates should not fail solely because no baseline exists yet.
Compare & release gates
compare runshits comparison APIs with explicit baseline/candidate pair (optional agent ids).compare gateposts to/v1/release-gates/evaluatewith those ids—response includespolicy_snapshot,evaluation_details, timestamps (seecompare.golong help).release-gate listsurfaces historical evaluations with optional filters.
Gate outcomes use structured status codes (documented in compare.go help text) for scripting.
Relationship to run create
run create still exists for power users, but product messaging steers new usage to eval start (run.go long help cross-links). Pick one style per automation story to avoid divergent flag semantics.
Docs for consumers vs operators
This page is for people driving hosted or staging workspaces from the CLI. Self-host operators should still read Self-host quickstart for bringing up Postgres + Temporal + worker parity.