Challenge packs

Eval workflows & gates

CLI-first eval commands, baselines, scorecards, comparisons, and release gates as implemented in cli/cmd.

Challenge packs are useless until a run binds a challenge_pack_version_id to one or more deployments. The product ships a workflow-oriented CLI path so you rarely hand-copy UUIDs.

Happy path commands

bash
1agentclash eval start --follow
2agentclash baseline set [run_id] [--agent <label>]
3agentclash eval scorecard [run_id] [--agent <label>] [--json]
4agentclash compare runs --baseline <run> --candidate <run>
5agentclash compare gate --baseline <run> --candidate <run>
6agentclash release-gate list [--baseline ... --candidate ...]

Typical flow

  1. Start a run. eval start --follow creates the run and streams events to your terminal until it finishes. It prints the run ID you'll reuse below.
  2. Bookmark a baseline. baseline set <run_id> saves a workspace-scoped pointer to a known-good run. Do this once; it precedes step 4.
  3. Run your candidate. eval start --follow again, against the new deployment or pack version.
  4. Read the scorecard. eval scorecard <candidate_run_id> prints the candidate scorecard and—because a baseline is bookmarked—enriches it with the baseline, a comparison, and a release-gate verdict in one payload.
  5. Gate in CI. compare gate --baseline <run> --candidate <run> exits nonzero on a regression, so a CI step can block a promotion.

Source pointers for contributors: these commands live in cli/cmd/eval.go, baseline.go, compare.go, and release_gate.go.

eval start

Key flags (defined in cli/cmd/eval.go; resolution logic in eval_resolve.go):

FlagPurpose
--packPack id, slug, or exact name
--pack-versionVersion id or integer
--input-setDisambiguates when multiple sets published
--deploymentRepeatable; accepts id or exact name
--followStream run events after creation
--scopefull vs suite_only regression scoping
--suite / --caseTarget regression fixtures
--race-contextPeer standings injection (multi-agent)

Non-interactive environments must supply enough disambiguators—resolver errors spell out what’s missing.

eval scorecard

Prints the candidate scorecard. When a baseline is bookmarked for the workspace, it also runs the comparison and release-gate evaluation and returns everything in one JSON payload for CI. With --json the payload has these top-level keys:

KeyContents
candidateThe run/agent being scored (run_id, run_status, run_agent_id, …)
scorecardThe candidate's per-dimension scorecard
baselineThe bookmarked baseline pointer, or null if none is set
comparisonBaseline-vs-candidate deltas, or null if no baseline
release_gateThe gate verdict, or null if no baseline

If the scorecard is still computing (HTTP 202) or errored (HTTP 409), the baseline, comparison, and release_gate keys stay null and a 409 exits 1.

Omitting run_id selects the most recent run in the workspace by created_at (newest first). This is purely "latest run"—it does not skip the baseline run—so in automation pass an explicit run id rather than relying on ordering.

(Envelope assembly lives in buildEvalScorecardEnvelope in cli/cmd/eval.go; the latest-run rule is in resolveRunSummary in eval_resolve.go.)

Baseline bookmarks

baseline set|show|clear stores a workspace-scoped pointer to a run (and optional specific run_agent). This unlocks diff language inside eval scorecard without retyping ids.

doctor treats missing baseline as informational only—CI gates should not fail solely because no baseline exists yet.

Compare & release gates

  • compare runs calls GET /v1/compare with an explicit baseline/candidate pair (optional agent ids) and prints per-dimension deltas.
  • compare gate posts to POST /v1/release-gates/evaluate with those ids. With --json it prints the full server response (verdict plus fields like policy_snapshot and evaluation_details); for the exact response shape see the OpenAPI spec.
  • release-gate list surfaces historical evaluations with optional filters.

compare gate sets the process exit code so a CI step can branch on the outcome:

Exit codeVerdictMeaning
0passGate passed; safe to promote.
1failRegressions detected.
2warnSoft warning; review before promoting.
3insufficient_evidencePolicy could not be evaluated; fix the spec or rerun.

(These exit codes are also printed in agentclash compare gate --help.)

Relationship to run create

run create still exists for power users, but product messaging steers new usage to eval start (run.go long help cross-links). Pick one style per automation story to avoid divergent flag semantics.

Docs for consumers vs operators

This page is for people driving hosted or staging workspaces from the CLI. Self-host operators should still read Self-host quickstart for bringing up Postgres + Temporal + worker parity.

See also