Challenge packs

Eval workflows & gates

CLI-first eval commands, baselines, scorecards, comparisons, and release gates as implemented in cli/cmd.

Challenge packs are useless until a run binds a challenge_pack_version_id to one or more deployments. The product ships a workflow-oriented CLI path so you rarely hand-copy UUIDs.

Happy path commands

bash

1agentclash eval start --follow
2agentclash baseline set [run_id] [--agent <label>]
3agentclash eval scorecard [run_id] [--agent <label>] [--json]
4agentclash compare runs --baseline <run> --candidate <run>
5agentclash compare gate --baseline <run> --candidate <run>
6agentclash release-gate list [--baseline ... --candidate ...]

Typical flow

Start a run. eval start --follow creates the run and streams events to your terminal until it finishes. It prints the run ID you'll reuse below.
Bookmark a baseline. baseline set <run_id> saves a workspace-scoped pointer to a known-good run. Do this once; it precedes step 4.
Run your candidate. eval start --follow again, against the new deployment or pack version.
Read the scorecard. eval scorecard <candidate_run_id> prints the candidate scorecard and—because a baseline is bookmarked—enriches it with the baseline, a comparison, and a release-gate verdict in one payload.
Gate in CI. compare gate --baseline <run> --candidate <run> exits nonzero on a regression, so a CI step can block a promotion.

Source pointers for contributors: these commands live in cli/cmd/eval.go, baseline.go, compare.go, and release_gate.go.

`eval start`

Key flags (defined in cli/cmd/eval.go; resolution logic in eval_resolve.go):

Flag	Purpose
`--pack`	Pack id, slug, or exact name
`--pack-version`	Version id or integer
`--input-set`	Disambiguates when multiple sets published
`--deployment`	Repeatable; accepts id or exact name
`--follow`	Stream run events after creation
`--scope`	`full` vs `suite_only` regression scoping
`--suite` / `--case`	Target regression fixtures
`--peer-standings`	Peer standings injection (multi-agent)

Non-interactive environments must supply enough disambiguators—resolver errors spell out what’s missing.

`eval scorecard`

Prints the candidate scorecard. When a baseline is bookmarked for the workspace, it also runs the comparison and release-gate evaluation and returns everything in one JSON payload for CI. With --json the payload has these top-level keys:

Key	Contents
`candidate`	The run/agent being scored (`run_id`, `run_status`, `run_agent_id`, …)
`scorecard`	The candidate's per-dimension scorecard
`baseline`	The bookmarked baseline pointer, or `null` if none is set
`comparison`	Baseline-vs-candidate deltas, or `null` if no baseline
`release_gate`	The gate verdict, or `null` if no baseline

If the scorecard is still computing (HTTP 202) or errored (HTTP 409), the baseline, comparison, and release_gate keys stay null and a 409 exits 1.

Omitting run_id selects the most recent run in the workspace by created_at (newest first). This is purely "latest run"—it does not skip the baseline run—so in automation pass an explicit run id rather than relying on ordering.

(Envelope assembly lives in buildEvalScorecardEnvelope in cli/cmd/eval.go; the latest-run rule is in resolveRunSummary in eval_resolve.go.)

Baseline bookmarks

baseline set|show|clear stores a workspace-scoped pointer to a run (and optional specific run_agent). This unlocks diff language inside eval scorecard without retyping ids.

doctor treats missing baseline as informational only—CI gates should not fail solely because no baseline exists yet.

Compare & release gates

compare runs calls GET /v1/compare with an explicit baseline/candidate pair (optional agent ids) and prints per-dimension deltas.
compare gate posts to POST /v1/release-gates/evaluate with those ids. With --json it prints the full server response (verdict plus fields like policy_snapshot and evaluation_details); for the exact response shape see the OpenAPI spec.
release-gate list surfaces historical evaluations with optional filters.

compare gate sets the process exit code so a CI step can branch on the outcome:

Exit code	Verdict	Meaning
`0`	`pass`	Gate passed; safe to promote.
`1`	`fail`	Regressions detected.
`2`	`warn`	Soft warning; review before promoting.
`3`	`insufficient_evidence`	Policy could not be evaluated; fix the spec or rerun.

(These exit codes are also printed in agentclash compare gate --help.)

Relationship to `run create`

run create still exists for power users, but product messaging steers new usage to eval start (run.go long help cross-links). Pick one style per automation story to avoid divergent flag semantics.

Docs for consumers vs operators

This page is for people driving hosted or staging workspaces from the CLI. Self-host operators should still read Self-host quickstart for bringing up Postgres + Temporal + worker parity.

Happy path commands

Typical flow

eval start

eval scorecard