Challenge packs
Eval workflows & gates
CLI-first eval commands, baselines, scorecards, comparisons, and release gates as implemented in cli/cmd.
Challenge packs are useless until a run binds a challenge_pack_version_id to one or more deployments. The product ships a workflow-oriented CLI path so you rarely hand-copy UUIDs.
Happy path commands
1agentclash eval start --follow
2agentclash baseline set [run_id] [--agent <label>]
3agentclash eval scorecard [run_id] [--agent <label>] [--json]
4agentclash compare runs --baseline <run> --candidate <run>
5agentclash compare gate --baseline <run> --candidate <run>
6agentclash release-gate list [--baseline ... --candidate ...]Typical flow
- Start a run.
eval start --followcreates the run and streams events to your terminal until it finishes. It prints the run ID you'll reuse below. - Bookmark a baseline.
baseline set <run_id>saves a workspace-scoped pointer to a known-good run. Do this once; it precedes step 4. - Run your candidate.
eval start --followagain, against the new deployment or pack version. - Read the scorecard.
eval scorecard <candidate_run_id>prints the candidate scorecard and—because a baseline is bookmarked—enriches it with the baseline, a comparison, and a release-gate verdict in one payload. - Gate in CI.
compare gate --baseline <run> --candidate <run>exits nonzero on a regression, so a CI step can block a promotion.
Source pointers for contributors: these commands live in
cli/cmd/eval.go,baseline.go,compare.go, andrelease_gate.go.
eval start
Key flags (defined in cli/cmd/eval.go; resolution logic in eval_resolve.go):
| Flag | Purpose |
|---|---|
--pack | Pack id, slug, or exact name |
--pack-version | Version id or integer |
--input-set | Disambiguates when multiple sets published |
--deployment | Repeatable; accepts id or exact name |
--follow | Stream run events after creation |
--scope | full vs suite_only regression scoping |
--suite / --case | Target regression fixtures |
--race-context | Peer standings injection (multi-agent) |
Non-interactive environments must supply enough disambiguators—resolver errors spell out what’s missing.
eval scorecard
Prints the candidate scorecard. When a baseline is bookmarked for the workspace, it also runs the comparison and release-gate evaluation and returns everything in one JSON payload for CI. With --json the payload has these top-level keys:
| Key | Contents |
|---|---|
candidate | The run/agent being scored (run_id, run_status, run_agent_id, …) |
scorecard | The candidate's per-dimension scorecard |
baseline | The bookmarked baseline pointer, or null if none is set |
comparison | Baseline-vs-candidate deltas, or null if no baseline |
release_gate | The gate verdict, or null if no baseline |
If the scorecard is still computing (HTTP 202) or errored (HTTP 409), the baseline, comparison, and release_gate keys stay null and a 409 exits 1.
Omitting run_id selects the most recent run in the workspace by created_at (newest first). This is purely "latest run"—it does not skip the baseline run—so in automation pass an explicit run id rather than relying on ordering.
(Envelope assembly lives in buildEvalScorecardEnvelope in cli/cmd/eval.go; the latest-run rule is in resolveRunSummary in eval_resolve.go.)
Baseline bookmarks
baseline set|show|clear stores a workspace-scoped pointer to a run (and optional specific run_agent). This unlocks diff language inside eval scorecard without retyping ids.
doctor treats missing baseline as informational only—CI gates should not fail solely because no baseline exists yet.
Compare & release gates
compare runscallsGET /v1/comparewith an explicit baseline/candidate pair (optional agent ids) and prints per-dimension deltas.compare gateposts toPOST /v1/release-gates/evaluatewith those ids. With--jsonit prints the full server response (verdict plus fields likepolicy_snapshotandevaluation_details); for the exact response shape see the OpenAPI spec.release-gate listsurfaces historical evaluations with optional filters.
compare gate sets the process exit code so a CI step can branch on the outcome:
| Exit code | Verdict | Meaning |
|---|---|---|
0 | pass | Gate passed; safe to promote. |
1 | fail | Regressions detected. |
2 | warn | Soft warning; review before promoting. |
3 | insufficient_evidence | Policy could not be evaluated; fix the spec or rerun. |
(These exit codes are also printed in agentclash compare gate --help.)
Relationship to run create
run create still exists for power users, but product messaging steers new usage to eval start (run.go long help cross-links). Pick one style per automation story to avoid divergent flag semantics.
Docs for consumers vs operators
This page is for people driving hosted or staging workspaces from the CLI. Self-host operators should still read Self-host quickstart for bringing up Postgres + Temporal + worker parity.