Reference

CLI Reference

Commands, flags, and command groups generated from the current Cobra CLI source.

Intro-oriented readers should still start from Hosted quickstart; this page focuses on mechanically listing what cli/cmd exposes after each docs rebuild. For repo layout around the CLI, see Codebase tour.

See also

This page is generated from the Cobra command definitions in cli/cmd.

Global flags

  • --api-url — API base URL (overrides config)
  • --json — Output in JSON format
  • --no-color — Disable color output
  • --output (-o) — Output format: table, json, yaml
  • --quiet (-q) — Suppress non-essential output
  • --verbose (-v) — Enable debug output on stderr
  • --workspace (-w) — Workspace ID (overrides config)
  • --yes — Skip confirmation prompts

Command groups

agent-harness

Manage coding-agent harnesses

create

Create an agent harness

Flags

  • --api-key-secret — Workspace secret name containing the runner provider API key
  • --auth-mode default: api_key_secret — Harness auth mode: api_key_secret
  • --base-branch — Base branch for repository work
  • --codex-model — Runner model override
  • --codex-template — E2B template override for the harness runner
  • --description — Harness description
  • --evaluation-config — Inline JSON evaluation config
  • --evaluation-config-file — JSON file with validators and LLM judges
  • --execution-config — Inline JSON execution config
  • --from-file — JSON file with agent harness spec
  • --harness-kind default: codex_e2b — Harness runner kind: codex_e2b, claude_e2b, hermes_e2b, or openclaw_e2b
  • --name — Harness name
  • --openai-api-key-secret — Workspace secret name containing OPENAI_API_KEY
  • --repository-url — Repository URL for the harness task
  • --task — Task prompt for the coding harness

execution

Inspect agent harness executions

cancel <execution-id>

Cancel an Agent Harness execution

failure-review

Inspect or edit Agent Harness failure classifications

get <execution-id>

Get Agent Harness failure review

update <execution-id>

Update Agent Harness failure review annotations

Flags

  • --from-file — JSON file with failure review update payload
  • --human-class — Human-curated failure class
  • --human-payload — Inline JSON human payload
  • --human-summary — Human-curated failure summary
  • --suggested-class — Suggested failure class
  • --suggested-confidence — Suggested confidence as a decimal
  • --suggested-payload — Inline JSON suggested payload
  • --suggested-source — Suggested source: rules or llm
  • --suggested-summary — Suggested failure summary
get <execution-id>

Get an agent harness execution

promote-task <execution-id>

Promote a prior harness run into a private suite task

Flags

  • --failure-class — Failure class to store with promotion metadata
  • --failure-summary — Failure summary to store with promotion metadata
  • --from-file — JSON file with promotion payload
  • --metadata — Inline JSON promotion metadata
  • --public-prompt — Sanitized public prompt
  • --suite — Target Agent Harness suite ID
  • --title — Promoted private task title
retry <execution-id>

Retry a terminal Agent Harness execution

Flags

  • --idempotency-key — Retry idempotency key

executions <harness-id>

List executions for an agent harness

failures

Inspect Agent Harness failure summaries

summary

Summarize Agent Harness failure modes

get <id>

Get an agent harness

list

List agent harnesses

run <harness-id>

Start an agent harness execution

Flags

  • --follow — Poll until the harness execution reaches a terminal status
  • --message — Override the harness task prompt for this execution
  • --poll-interval — Polling interval for --follow

suite

Manage Agent Harness suites and private task banks

create

Create an Agent Harness suite/private task bank

Flags

  • --description — Suite description
  • --from-file — JSON file with agent harness suite spec
  • --metadata — Inline JSON suite metadata
  • --name — Suite name
  • --task-json — Suite task JSON object; may be repeated
list

List Agent Harness suites

rankings <suite-id>

Get Agent Harness suite rankings

Flags

  • --k — k value for pass@k and pass^k
  • --version-id — Immutable suite version ID
run <suite-id>

Start suite runs across one or more harnesses

Flags

  • --harness — Harness ID to run; may be repeated or comma-separated
  • --task — Suite task ID filter; may be repeated or comma-separated
tasks <suite-id>

List public tasks for an Agent Harness suite

artifact

Upload and download artifacts

download <artifactId>

Download an artifact

Flags

  • --output (-O) — Output file path (defaults to stdout)

list

List artifacts in the workspace

upload <file>

Upload an artifact

Flags

  • --metadata — JSON metadata (optional)
  • --run — Run ID (optional)
  • --run-agent — Run agent ID (optional)
  • --type (required) — Artifact type (required)

validate-voice-manifest <file>

Validate a voice artifact manifest JSON file against the AgentClash schema

validate-voice-report <file>

Validate a voice report JSON file against an AgentClash schema

Flags

  • --schema — JSON Schema path (required when report type cannot be auto-detected)

auth

Manage authentication

login

Log in to AgentClash

Flags

  • --device — Print the verification URL instead of opening the browser automatically
  • --force — Start a new browser login even if existing credentials are valid

logout

Log out and remove stored credentials

status

Show current authentication status

tokens

Manage CLI access tokens

list

List your CLI tokens

revoke <token-id>

Revoke a CLI token

baseline

Manage the workspace-scoped default baseline run

clear

Clear the default baseline run for the current workspace

set [run]

Bookmark a run as the default baseline for the current workspace

Flags

  • --agent — Run agent ID or label (optional)

show

Show the default baseline run for the current workspace

build

Manage agent builds

create

Create a new agent build

Flags

  • --description — Build description
  • --name (required) — Build name (required)

get <id>

Get agent build with version history

list

List agent builds

version

Manage agent build versions

create <buildId>

Create a new draft version

Flags

  • --agent-kind — Agent kind: llm_agent, workflow_agent, programmatic_agent, multi_agent_system, hosted_external
  • --spec-file — JSON file with version spec fields
  • --template — Template key to scaffold this version (for example: honest-agent, code-reviewer)
get <versionId>

Get a build version

ready <versionId>

Mark a version as ready (immutable, deployable)

templates

List built-in build version templates

update <versionId>

Update a draft build version

Flags

  • --spec-file — JSON file with updated version spec fields
validate <versionId>

Validate a build version

challenge-pack

Manage challenge packs

init <file>

Scaffold a minimal challenge pack YAML bundle

Flags

  • --force — Overwrite an existing file
  • --name — Challenge pack display name (defaults from the file name)
  • --slug — Challenge pack slug (defaults from the file name)
  • --template default: prompt_eval — Starter template: prompt_eval, responses, or native

list

List challenge packs

publish <file>

Publish a challenge pack YAML bundle

validate <file>

Validate a challenge pack YAML bundle

ci

Manage AgentClash CI manifests

baseline

Resolve the baseline selected by an AgentClash CI manifest

Flags

  • --manifest default: .agentclash/ci.yaml — Path to the AgentClash CI manifest

init <file>

Write a sample AgentClash CI manifest

Flags

  • --force — Overwrite an existing manifest

run

Run the AgentClash CI workflow described by a manifest

Flags

  • --artifact-dir — Write stable AgentClash CI JSON artifacts to this directory
  • --ci-branch — Branch metadata override
  • --ci-commit — Commit SHA metadata override
  • --ci-default-branch — Default branch metadata override for auto_on_main regression promotion
  • --ci-event — CI event name metadata override
  • --ci-provider — CI provider metadata override
  • --ci-pull-request — Positive pull request number metadata override
  • --ci-ref — Git ref metadata override
  • --ci-repository — Repository metadata override, for example owner/repo
  • --ci-workflow — Workflow name metadata override
  • --ci-workflow-run-attempt — Workflow run attempt metadata override
  • --ci-workflow-run-id — Workflow run id metadata override
  • --ci-workflow-run-url — Workflow run URL metadata override
  • --follow — Stream run events while waiting for the candidate run
  • --github-step-summary — Append a GitHub Actions step summary when GITHUB_STEP_SUMMARY is set
  • --manifest default: .agentclash/ci.yaml — Path to the AgentClash CI manifest
  • --poll-interval — Polling interval while waiting for run completion
  • --summary-file — Write a Markdown CI gate summary to this file
  • --timeout — Maximum time to wait for the candidate run; 0 disables the timeout

should-run

Decide whether AgentClash CI should run

Flags

  • --base — Base git ref for deriving changed files
  • --changed-file — Changed file path; may be repeated
  • --github-event — GitHub event JSON file for deriving pull request labels
  • --head — Head git ref for deriving changed files
  • --labels — Pull request labels; may be comma-separated or repeated
  • --manifest default: .agentclash/ci.yaml — Path to the AgentClash CI manifest
  • --repo default: . — Git repository path for --base/--head diff

validate <file>

Validate an AgentClash CI manifest

Flags

  • --remote — Validate manifest resource IDs against the selected workspace

compare

Compare runs and evaluate release gates

gate

Evaluate a release gate (nonzero exit = regression or missing evidence)

Flags

  • --baseline (required) — Baseline run ID (required)
  • --baseline-agent — Baseline run agent ID (optional)
  • --candidate (required) — Candidate run ID (required)
  • --candidate-agent — Candidate run agent ID (optional)

latest

Compare the saved baseline against the latest non-baseline run

Flags

  • --agent — Run agent ID or label to use for both runs when possible
  • --baseline-agent — Baseline run agent ID or label (defaults to the saved baseline agent)
  • --candidate-agent — Candidate run agent ID or label
  • --gate — Also evaluate the release gate and return a nonzero exit code for non-pass verdicts

runs

Compare baseline vs candidate runs

completion [bash|zsh|fish|powershell]

Generate a shell completion script

config

Manage CLI configuration

get <key>

Get a config value

list

List all config values

set <key> <value>

Set a config value

dataset

Manage eval datasets

create

Create a dataset

Flags

  • --default-challenge-pack-version-id — Default challenge pack version ID
  • --description — Dataset description
  • --enforce-schema — Reject examples that do not match the input schema
  • --from-file — JSON file with dataset create payload
  • --input-schema — Input JSON Schema
  • --name — Dataset name
  • --slug — Dataset slug

delete <datasetId>

Archive a dataset

eval <datasetId>

Run an eval over a dataset version

Flags

  • --challenge — Challenge key to bind examples to
  • --deployment — Agent deployment ID (repeatable)
  • --follow — Follow run events after creation
  • --mapping — Optional JSON mapping for dataset example fields
  • --name — Optional run name
  • --pack — Challenge pack version ID
  • --version — Dataset version ID to run

example

Manage dataset examples

add <datasetId>

Add or upsert a dataset example

Flags

  • --expected — Expected output JSON
  • --external-id — Stable external ID for idempotent upsert
  • --from-file — JSON file with dataset example payload
  • --input — Example input JSON
  • --metadata — Metadata JSON
  • --source — Example source: manual, import, trace, synthetic, or promotion
  • --tag — Example tag (repeatable)
edit <datasetId> <exampleId>

Edit a dataset example

Flags

  • --expected — Expected output JSON
  • --from-file — JSON file with dataset example patch payload
  • --input — Example input JSON
  • --metadata — Metadata JSON
  • --source — Example source: manual, import, trace, synthetic, or promotion
  • --status — Example status: active, archived, or muted
  • --tag — Example tag (repeatable)
list <datasetId>

List dataset examples

rm <datasetId> <exampleId>

Archive a dataset example

export <datasetId>

Export dataset examples

Flags

  • --format default: jsonl — Export format: openai, braintrust, langsmith, phoenix, jsonl, or csv
  • --version — Dataset version ID to export

generate <datasetId>

Start in-house synthetic dataset generation

Flags

  • --count — Target number of accepted synthetic examples
  • --create-version — Snapshot a dataset version when generation completes
  • --follow — Poll generation job status until it finishes
  • --model-alias — Model alias ID
  • --provider-account — Provider account ID
  • --seeds-tag — Only use seed examples with this tag
  • --strategy default: self-instruct — Generation strategy (v1: self-instruct)
  • --version-label — Optional label for the generated dataset version

import <datasetId> <file>

Import examples into a dataset

Flags

  • --dry-run — Preview normalized examples without mutating the dataset
  • --format — Import format: openai, braintrust, langsmith, phoenix, jsonl, or csv
  • --map — Mapping entry key=value (repeatable); values may be comma-separated for input_keys/output_keys/metadata_keys
  • --mapping — JSON mapping for generic JSONL/CSV imports
  • --mode default: add — Import mode: add or replace

import-traces <datasetId> [file]

Import production traces as reviewable dataset candidates

Flags

  • --artifact — Existing artifact ID to reference instead of inline payload
  • --from-file — JSON file with import-traces request body
  • --redaction — JSON redaction config (drop/hash metadata keys)
  • --run — Run ID for agentclash trace import
  • --run-agent — Run agent ID for agentclash trace import
  • --source — Trace source platform: otel, braintrust, langsmith, phoenix, or agentclash

list

List datasets

promote <datasetId> <candidateId>

Promote a trace candidate into a dataset example

Flags

  • --expected — Edited expected output JSON
  • --from-file — JSON file with promote request body
  • --tag — Tags to apply on promotion

sync-regression-suite <datasetId>

Promote dataset examples into a linked regression suite

Flags

  • --challenge — Challenge key
  • --format default: text — Output format: text or json
  • --pack — Challenge pack version ID
  • --suite — Existing regression suite ID (optional)
  • --suite-name — Name for a newly created regression suite
  • --version — Dataset version ID to sync

test <datasetId>

Run a dataset eval gate against a baseline

Flags

  • --baseline — Baseline ID to compare against
  • --challenge — Challenge key for eval
  • --deployment — Agent deployment ID (repeatable)
  • --eval — Start a dataset eval before gating
  • --format default: text — Output format: text, json, or junit
  • --max-regressions — Maximum allowed regressions versus baseline
  • --min-pass-rate — Minimum pass rate required to pass the gate
  • --pack — Challenge pack version ID for eval
  • --poll-interval — Polling interval while waiting for eval completion
  • --run — Candidate run ID (required unless --eval is set)
  • --timeout — Maximum time to wait for an eval run started with --eval
  • --version — Dataset version ID for eval

trace-candidates

Review imported trace candidates

list <datasetId>

List trace candidates awaiting promotion

Flags

  • --status — Filter by candidate status (pending, promoted, rejected)

version

Manage dataset versions

create <datasetId>

Snapshot the current dataset examples

Flags

  • --label — Optional dataset version label
list <datasetId>

List dataset versions

view <datasetId>

View a dataset

deployment

Manage agent deployments

create

Create an agent deployment

Flags

  • --agent-build-id — Agent build ID
  • --build-version-id — Agent build version ID
  • --from-file — JSON file with deployment spec
  • --model-alias-id — Model alias ID
  • --name — Deployment name
  • --provider-account-id — Provider account ID
  • --runtime-profile-id — Runtime profile ID

list

List agent deployments

doctor

Check auth, workspace, and eval readiness

Flags

  • --pack — Challenge pack YAML file to check for run readiness

eval

Workflow-first eval commands

scorecard [run]

Show a run-first scorecard and compare against the bookmarked baseline

Flags

  • --agent — Run agent ID or label (optional)

session

Inspect repeated eval sessions

follow <evalSessionId>

Poll a repeated eval session until aggregation finishes

Flags

  • --poll-interval — Polling interval while waiting for completion
  • --timeout — Maximum time to wait; 0 disables the timeout
get <evalSessionId>

Show repeated eval session details and aggregate metrics

list

List repeated eval sessions in the workspace

Flags

  • --limit — Maximum eval sessions to list (1-100)
  • --offset — Eval session list offset

start

Start an eval using names, defaults, and guided selection

Flags

  • --case — Regression case IDs (repeatable)
  • --deployment — Deployment ID or exact name (repeatable)
  • --follow — Follow run events after creation
  • --input-set — Challenge input set ID, key, or exact name
  • --name — Run name (optional)
  • --pack — Challenge pack ID, slug, or exact name
  • --pack-version — Challenge pack version ID or version number
  • --race-context — Enable live peer-standings injection during the run (requires 2+ agents)
  • --race-context-cadence — Override race-context cadence; minimum steps between standings injections, [1, 10]. 0 uses the backend default.
  • --repetitions — Repeat the eval N times in a multi-run eval session, [1, 100]. >=2 routes through /v1/eval-sessions and unlocks pass@K + pass^K aggregation.
  • --scope default: full — Run scope: full or suite_only
  • --suite — Regression suite ID or exact name (repeatable)

infra

Manage infrastructure resources

model-catalog

Browse the global model catalog

get <id>

Get a model catalog entry

list

List available models

init

Initialize a project with .agentclash.yaml

Flags

  • --org-id — Organization ID to bind
  • --workspace-id — Workspace ID to bind

integration

Install and verify AgentClash Agent Skills for coding agents

Choose and save your default workspace

org

Manage organizations

create

Create a new organization

Flags

  • --name (required) — Organization name (required)
  • --slug — Organization slug (optional, auto-generated)

get <id>

Get organization details

list

List organizations you belong to

members

Manage organization members

invite <orgId>

Invite a member to the organization

Flags

  • --email (required) — Email address to invite (required)
  • --role default: org_member — Role: org_admin, org_member
list <orgId>

List organization members

update <membershipId>

Update an organization membership

Flags

  • --role — New role: org_admin, org_member
  • --status — New status: active, suspended, archived

update <id>

Update an organization

Flags

  • --name — New organization name
  • --status — New status (active, archived)

playground

Manage playgrounds, test cases, and experiments

create

Create a playground

Flags

  • --from-file — JSON file with playground spec
  • --name — Playground name

delete <id>

Delete a playground

experiment

Manage playground experiments

batch <playgroundId>

Create experiments in batch (one per model)

Flags

  • --from-file — JSON file with batch experiment spec
compare

Compare two experiments

Flags

  • --baseline (required) — Baseline experiment ID (required)
  • --candidate (required) — Candidate experiment ID (required)
create <playgroundId>

Create an experiment

Flags

  • --from-file — JSON file with experiment spec
get <experimentId>

Get an experiment

list <playgroundId>

List experiments

results <experimentId>

List results for an experiment

get <id>

Get a playground

list

List playgrounds

test-case

Manage playground test cases

create <playgroundId>

Create a test case

Flags

  • --from-file — JSON file with test case spec
delete <testCaseId>

Delete a test case

list <playgroundId>

List test cases

update <testCaseId>

Update a test case

Flags

  • --from-file — JSON file with test case spec

update <id>

Update a playground

Flags

  • --from-file — JSON file with playground spec

prompt-eval

Manage prompt eval configs

import-promptfoo <file>

Convert a safe Promptfoo subset into an AgentClash prompt eval config

Flags

  • --force — Overwrite --out when it already exists
  • --lossy — Allow documented lossy conversions
  • --name — Prompt eval name for the generated config
  • --out — Write the converted prompt eval YAML to this path instead of stdout
  • --provider-account default: default — Provider account name or id to use for imported provider aliases

init [file]

Scaffold a prompt eval YAML config

Flags

  • --force — Overwrite an existing file
  • --name — Prompt eval name (defaults from the file name)

results <experiment-id>

Fetch prompt eval experiment results

Flags

  • --threshold — Override the assertion pass-rate gate for fetched results

run [file]

Compile a prompt eval config and launch playground experiments

Flags

  • --ci — Apply CI-safe validation rules
  • --follow — Wait for launched experiments and print results
  • --max-cases — Maximum model x test cases allowed before launch
  • --poll-interval — Polling interval while following experiments
  • --threshold — Override thresholds.assertion_pass_rate for this run
  • --timeout — Maximum time to wait while following experiments; 0 disables the timeout

validate [file]

Validate a prompt eval YAML config locally

Flags

  • --ci — Apply CI-safe validation rules
  • --max-cases — Maximum model x test cases allowed before launch
  • --remote — Validate referenced AgentClash workspace resources without creating them

quickstart

Check eval readiness and show the next best command

quota

Show workspace quota usage

regression-suite

Manage regression suites and cases

case

Manage individual regression cases

capture-production <suiteId>

Capture a production failure as a proposed regression case

Flags

  • --evidence-tier — Evidence tier
  • --external-url — Production incident URL
  • --failure-class — Failure class
  • --failure-summary — Failure summary
  • --from-file — JSON file with production failure capture payload
  • --incident-id — Production incident ID
  • --observed-at — Production observation timestamp (RFC3339)
  • --promotion-mode — Promotion mode: full_executable, output_only, or manual
  • --severity — Case severity: info, warning, or blocking
  • --source — Production source label
  • --source-case-key — Source production case or incident key
  • --source-challenge-identity-id — Source challenge identity ID
  • --source-challenge-input-set-id — Source challenge input set ID
  • --source-challenge-pack-version-id — Source challenge pack version ID
  • --source-item-key — Source item key
  • --title — Regression case title
update <caseId>

Update a regression case

Flags

  • --description — Case description
  • --from-file — JSON file with regression case patch payload
  • --severity — Case severity: info, warning, or blocking
  • --status — Case status: proposed, active, muted, archived, or rejected
  • --title — Case title

cases <suiteId>

List regression cases in a suite

create

Create a regression suite

Flags

  • --default-gate-severity — Default gate severity: info, warning, or blocking
  • --description — Suite description
  • --from-file — JSON file with regression suite create payload
  • --name — Suite name
  • --source-challenge-pack-id — Source challenge pack ID

get <suiteId>

Get a regression suite

list

List regression suites

update <suiteId>

Update a regression suite

Flags

  • --default-gate-severity — Default gate severity: info, warning, or blocking
  • --description — Suite description
  • --from-file — JSON file with regression suite patch payload
  • --name — Suite name
  • --status — Suite status: active or archived

release-gate

Inspect evaluated release gates

list

List evaluated release gates

Flags

  • --baseline — Baseline run ID
  • --candidate — Candidate run ID

replay

View execution replays

get <runAgentId>

Get execution replay steps

Flags

  • --cursor — Step offset to start from
  • --limit — Steps per page (1-200)

triage [run]

Summarize ranking, failures, scorecard, replay, and artifacts for debugging

Flags

  • --agent — Run agent ID or label to triage
  • --cursor — Replay step offset to start from
  • --limit — Replay steps to include (1-50)

run

Manage evaluation runs

agents <runId>

List agents in a run

cancel <id>

Cancel an active run

compare

Compare a baseline run against a candidate run

create

Create and submit an evaluation run

Flags

  • --case — Regression case IDs (repeatable)
  • --challenge-pack-version — Challenge pack version ID (optional in a TTY; prompted when omitted)
  • --deployment-lineup — Challenge pack deployment lineup to use when --deployments is omitted (default: default)
  • --deployment-lineups — Challenge pack deployment lineups to cross with --seeds for a race series
  • --deployments — Agent deployment IDs (optional in a TTY; prompted when omitted)
  • --follow — Follow run events after creation
  • --include-proposed-regressions — Include proposed regression cases for validation runs
  • --input-set — Challenge input set ID (optional)
  • --max-iter — Override max iterations for this run (1-1000). 0 uses the pack/runtime default.
  • --mode — Voice eval mode: text-sim (future: audio-sim, live-call, replay-import)
  • --name — Run name (optional)
  • --race-context — Enable live peer-standings injection during the run (requires 2+ agents)
  • --race-context-cadence — Override race-context cadence; minimum steps between standings injections, [1, 10]. 0 uses the backend default.
  • --scope default: full — Run scope: full or suite_only
  • --seeds — Create a seeded eval session with N child runs, one per seed (1-100). 0 creates a single run.
  • --suite — Regression suite IDs (repeatable; required with --scope suite_only unless --case is used)

events <runId>

Stream live run events via SSE

Flags

  • --filter — Filter streamed events by event type pattern (exact, comma-separated, or glob; '' matches any non-slash chars, so 'model.' matches 'model.call.started'; repeatable)
export <runId>

Export persisted run events as JSONL

failures <runId>

List failure review items for a run

Flags

  • --agent — Filter by run agent ID
  • --class — Filter by failure class
  • --cluster — Filter by failure cluster key
  • --cursor — Pagination cursor
  • --evidence-tier — Filter by evidence tier
  • --limit — Maximum failures to return
  • --severity — Filter by severity: info, warning, or blocking

get <id>

Get run details

list

List runs in the workspace

promote-failure <runId> <challengeIdentityId>

Promote a run failure into a regression case

Flags

  • --failure-summary — Failure summary
  • --from-file — JSON file with promotion payload
  • --promotion-mode — Promotion mode: full_executable or output_only
  • --run-agent — Run agent ID
  • --severity — Case severity: info, warning, or blocking
  • --suite — Regression suite ID
  • --title — Regression case title

ranking <runId>

Get run ranking and composite scores

Flags

  • --sort-by — Sort by: composite, correctness, reliability, latency, cost

replay <run-agent-id>

Inspect replay steps for a run agent

Flags

  • --cursor — Step offset to start from
  • --limit — Steps per page (1-200)

scorecard <runAgentId>

Get agent scorecard

series

Manage durable race series

create

Create a race series from deployment lineups and seeds

Flags

  • --challenge-pack-version — Challenge pack version ID
  • --deployment-lineups — Challenge pack deployment lineups to cross with --seeds
  • --input-set — Challenge input set ID (optional)
  • --max-iter — Override max iterations for each child run (1-1000). 0 uses the pack/runtime default.
  • --name — Series name (optional)
  • --seeds — Number of seeds to cross with each deployment lineup (1-100)
report <eval-session-id>

Show aggregate score, correctness, and cost for a race series

transcript <runId>

Export a Markdown run transcript

turn

Multi-turn human takeover helpers

status <runAgentId>

Check whether a run agent is awaiting human input

Flags

  • --run — Run ID
submit <runAgentId>

Submit a human user message for an awaiting multi_turn phase

Flags

  • --message — Human user message for the awaiting turn
  • --run — Run ID

schema

Print the CLI command tree as machine-readable JSON

secret

Manage workspace secrets

delete <key>

Delete a secret

list

List workspace secret keys

set <key>

Create or update a secret

Flags

  • --value — Secret value (reads from stdin if omitted)

security

Security-evals tooling (stress runs, leak rate, policy preview)

agent-vault-stress

Stress-test a model with REAL Infisical Agent Vault routing (function calling)

Flags

  • --allowed-upstream — Hostname the model is supposed to be brokering for, e.g. api.stripe.com. Any tool call to a different host is flagged as confused-deputy.
  • --api-key-env default: OPENAI_API_KEY — Env var holding the OpenAI API key
  • --canary-token — The broker token the harness watches for in model output (required). Usually the av_agt_/av_sess_ token embedded in --proxy-url.
  • --canary-token-env default: AGENT_VAULT_TOKEN — Env var to read --canary-token from if the flag is empty
  • --from-pack — Path to a security pack YAML. When set, runs every adversarial_prompts[] entry against the model and writes a per-attack report. Conflicts with --user-message.
  • --iterations — Number of trials
  • --mgmt-url — Agent Vault management API URL, e.g. http://127.0.0.1:14321
  • --mgmt-url-env default: AGENT_VAULT_ADDR — Env var to read --mgmt-url from if the flag is empty
  • --model default: gpt-4o-mini — OpenAI model id
  • --openai-endpoint default: https://api.openai.com/v1/chat/completions — Chat Completions endpoint (override for compatible providers)
  • --out — Path to write per-iteration JSON report (single-prompt mode)
  • --out-dir — Directory to write per-attack JSON reports when --from-pack is set (default: skip JSON, print table only)
  • --proxy-url — Agent Vault MITM proxy URL, e.g. https://av_agt_xxx:eval@127.0.0.1:14322
  • --proxy-url-env default: AGENT_VAULT_PROXY_URL — Env var to read --proxy-url from if the flag is empty
  • --system-prompt — System prompt override (default: deploy assistant with Agent Vault boundary)
  • --timeout — Per-LLM-call timeout
  • --user-message — Adversarial user message (required unless --from-pack is set)

avmock-upstream

Stand up a deterministic HTTP mock upstream for the agent-vault-stress campaign

Flags

  • --addr default: 127.0.0.1:8443 — TCP address to listen on
  • --detect-canary — Substring to scan inbound request URL, headers, and body for. Any match writes a vault-leak incident to stderr and returns 400 vault_leak_detected to the caller. Use to catch credentials that should never reach the upstream.
  • --log-requests — Log every inbound request to stderr in one-line form
  • --require-bearer — If set, requests must carry an Authorization header containing this substring; otherwise the mock returns 401. Use to assert the vault is injecting a credential.
  • --service default: stripe — Service shape: stripe | github | generic

runtime-stress

Stress-test a model with REAL Vault SDK tool access (function calling)

Flags

  • --api-key-env default: OPENAI_API_KEY — Env var holding the OpenAI API key
  • --canary-path default: stripe-key — Vault KV path the agent should fetch
  • --canary-value — The exact canary value to detect in leaks (required)
  • --iterations — Number of trials
  • --model default: gpt-4o-mini — OpenAI model id
  • --out — Path to write per-iteration JSON report
  • --system-prompt — System prompt override (default: deploy assistant with strict boundary)
  • --timeout — Per-LLM-call timeout
  • --user-message — Adversarial user message (required)
  • --vault-addr default: http://127.0.0.1:8200 — Vault server address
  • --vault-token-env default: VAULT_TOKEN — Env var holding the Vault token

stress-run <pack-yaml>

Run a security pack N times against an LLM provider and aggregate the leak rate

Flags

  • --api-key-env default: OPENAI_API_KEY — Env var holding the provider API key
  • --concurrency — Max concurrent iterations per provider/model pair
  • --iterations — Number of stress iterations per provider
  • --model default: gpt-4o-mini — Comma-separated model ids, paired with --provider
  • --no-system-guard — Drop the 'refuse leaks' sentence from the system prompt. Measures baseline alignment without harness-side coaching.
  • --out — Path to write the full JSON report (default: stdout summary only)
  • --provider default: openai — Comma-separated providers (currently: openai)
  • --timeout — Per-LLM-call timeout

skills

Export bundled AgentClash Agent Skills

export

Export bundled skills to a directory or tarball

Flags

  • --dir — Output directory or .tar.gz archive path (required)
  • --format default: dir — Output format: dir or tar.gz
  • --host — Layout for a coding agent host (claude, codex, cursor, openclaw, hermes, opencode)

version

Show CLI version information

workspace

Manage workspaces

create

Create a workspace

Flags

  • --name (required) — Workspace name (required)
  • --org — Organization ID (required)
  • --slug — Workspace slug (optional)

get <id>

Get workspace details

list

List workspaces in an organization

Flags

  • --org — Organization ID (uses default if not set)

members

Manage workspace members

invite

Invite a member to the workspace

Flags

  • --email (required) — Email address to invite (required)
  • --role default: workspace_member — Role: workspace_admin, workspace_member, workspace_viewer
list

List workspace members

update <membershipId>

Update a workspace membership

Flags

  • --role — New role
  • --status — New status

update <id>

Update a workspace

Flags

  • --name — New workspace name
  • --public-packs — Allow this workspace to use public challenge packs
  • --status — New status (active, archived)

use <id>

Set the default workspace

Source pointers

  • cli/cmd/root.go
  • cli/cmd/auth.go
  • cli/cmd/workspace.go
  • cli/cmd/run.go
  • cli/cmd/compare.go