Guides
CI/CD Agent Gates
Use a repo-tracked AgentClash CI manifest to define which agent revision, workload, baseline, and gate a pull request should run.
AgentClash CI should gate an agent revision, not only a prompt diff.
Prompt-focused tools can usually watch prompts/** and rerun a prompt eval. AgentClash's main product model is richer: an agent change can touch instructions, workflow code, tool bindings, model aliases, runtime limits, output schemas, guardrails, or retrieval configuration. The CI contract therefore needs to name the candidate agent build, deployment settings, challenge workload, baseline, and gate policy explicitly.
The manifest is the contract
Create a repo-tracked manifest:
agentclash ci init .agentclash/ci.yaml
agentclash ci validate .agentclash/ci.yaml
agentclash ci validate .agentclash/ci.yaml --remote --json
agentclash ci baseline --manifest .agentclash/ci.yaml --json
agentclash ci should-run --changed-file prompts/system.md --json
The generated manifest has this shape:
version: 1
trigger:
paths:
- .agentclash/agent.json
- prompts/**
- tools/**
labels:
- agentclash/eval
candidate:
build:
agent_build_id: 00000000-0000-0000-0000-000000000001
spec_file: .agentclash/agent.json
deployment:
name: pr-candidate
runtime_profile_id: 00000000-0000-0000-0000-000000000002
provider_account_id: 00000000-0000-0000-0000-000000000003
model_alias_id: 00000000-0000-0000-0000-000000000004
evaluation:
challenge_pack_version_id: 00000000-0000-0000-0000-000000000005
input_set_id: 00000000-0000-0000-0000-000000000006
regression_suites:
- 00000000-0000-0000-0000-000000000007
baseline:
run_id: 00000000-0000-0000-0000-000000000008
refresh: manual
max_age_days: 30
gate:
fail_on: regression
regressions:
promote_failures: proposed
The IDs in the generated file are placeholders. Replace them with workspace resources before using the manifest for a real gate.
Local validation is always offline. Add --remote when you want the CLI to call the AgentClash API and verify that the manifest's agent build, runtime profile, provider account, model alias, challenge pack version, input set, regression suites or cases, and baseline are visible from the selected workspace. Because this makes real authenticated API calls, set AGENTCLASH_API_URL, AGENTCLASH_TOKEN, and AGENTCLASH_WORKSPACE in CI and expect normal API latency, rate limits, and token scoping rules. JSON output includes a remote.checks[] entry per referenced field, so CI can report whether a failure came from the local manifest contract or from an API/resource check.
What each section means
triggersays which repository paths and optional labels should cause the workflow to run.candidate.buildnames the existing AgentClash build and the source-backed build-version spec to test.candidate.deploymentnames the runtime resources used for the candidate deployment.evaluationnames the workload: challenge pack version, optional input set, and optional regression suites or cases.baselinenames the locked reference run or deployment, plus explicit refresh and staleness rules.gatenames the release-gate failure threshold.regressionscontrols whether failed cases should only be reported, proposed for promotion, or eventually auto-promoted on main.
The important distinction is:
agent build/deployment = thing under test
challenge pack/regression suite = workload used to test it
release gate = decision policy
If you are deciding what the workload should contain, use CI/CD Workload Recipes for coding, research, support/ops, and long-horizon agent patterns.
Baseline strategy and refresh
For pull request gates, prefer baseline.run_id. It pins the exact accepted mainline run, so every reviewer can see what changed when the baseline moves. Add baseline.run_agent_id only when the locked run has multiple participants and the gate must compare against one specific agent lane.
Use baseline.deployment_id only when the team intentionally wants a moving selector. agentclash ci baseline resolves it to the newest completed run in the workspace that matches the manifest workload and includes that deployment. The command prints the exact resolved run_id and run_agent_id so downstream automation still compares against concrete IDs.
Use baseline.max_age_days when a stale baseline should block the gate. The resolver checks the chosen run's finished_at or created_at timestamp and fails instead of silently comparing against old behavior.
Refreshes are explicit:
baseline:
run_id: 00000000-0000-0000-0000-000000000008
refresh: manual
max_age_days: 30
manual: after a successful mainline run, updatebaseline.run_idin a reviewed change.propose: automation may propose the new baseline, but a human still reviews the manifest change.auto_on_main: a protected mainline workflow may update the manifest with an auditable commit after the gate passes.
Resolve the baseline before running a gate:
agentclash ci baseline \
--manifest .agentclash/ci.yaml \
--json
The JSON includes strategy, source, baseline.run_id, optional baseline.run_agent_id, refresh.mode, and refresh.next_action.
Decide whether CI should run
Use agentclash ci should-run when you want AgentClash to explain whether a pull request touches the agent contract. A matching path or label produces should_run: true; unrelated docs-only changes produce should_run: false.
agentclash ci should-run \
--manifest .agentclash/ci.yaml \
--changed-file prompts/system.md
Labels can force the gate even when paths do not match:
agentclash ci should-run \
--manifest .agentclash/ci.yaml \
--changed-file docs/readme.md \
--labels agentclash/eval \
--json
For local or GitHub Actions diffing, pass refs explicitly:
agentclash ci should-run \
--manifest .agentclash/ci.yaml \
--base origin/main \
--head HEAD \
--json
GitHub Actions sketch
The manifest is the single source of truth for the candidate revision, workload, baseline, and gate. A pull request workflow can validate it, decide whether it should run for the changed files, then let agentclash ci run create the candidate build version, deployment, run, and release-gate evaluation.
name: AgentClash gate
on:
pull_request:
paths:
- ".agentclash/**"
- "prompts/**"
- "tools/**"
jobs:
agentclash:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/setup-node@v4
with:
node-version: "22"
- run: npm i -g agentclash
- name: Validate AgentClash CI manifest
run: agentclash ci validate .agentclash/ci.yaml --remote
env:
AGENTCLASH_API_URL: https://api.agentclash.dev
AGENTCLASH_TOKEN: ${{ secrets.AGENTCLASH_TOKEN }}
AGENTCLASH_WORKSPACE: ${{ secrets.AGENTCLASH_WORKSPACE }}
- name: Decide whether AgentClash gate should run
id: should-run
run: |
SHOULD_RUN=$(agentclash ci should-run --json | jq -r '.should_run')
echo "should_run=$SHOULD_RUN" >> "$GITHUB_OUTPUT"
- name: Run AgentClash CI gate
if: steps.should-run.outputs.should_run == 'true'
id: agentclash
run: |
set +e
agentclash ci run --manifest .agentclash/ci.yaml --json \
> agentclash-ci-result.json
status=$?
cat agentclash-ci-result.json
echo "run_id=$(jq -r '.candidate.run_id' agentclash-ci-result.json)" >> "$GITHUB_OUTPUT"
echo "gate_verdict=$(jq -r '.gate_verdict' agentclash-ci-result.json)" >> "$GITHUB_OUTPUT"
exit "$status"
env:
AGENTCLASH_API_URL: https://api.agentclash.dev
AGENTCLASH_TOKEN: ${{ secrets.AGENTCLASH_TOKEN }}
AGENTCLASH_WORKSPACE: ${{ secrets.AGENTCLASH_WORKSPACE }}
agentclash ci run exits nonzero when the gate verdict should block CI, when the candidate run times out, or when the manifest/API setup is invalid.
Regression promotion policy
Do not auto-promote every PR failure by default. A bad run, flaky dependency, or weak evaluator could pollute the regression suite.
Use this conservative progression:
regressions:
promote_failures: disabled
Report failures only.
regressions:
promote_failures: proposed
Record promotion candidates for review.
regressions:
promote_failures: auto_on_main
Only a future main-branch workflow should auto-promote high-confidence failures after the gate has proven useful.
Current limits
agentclash ci validatevalidates the manifest shape locally; pass--remotefor API-backed resource checks.agentclash ci should-runonly decides whether a gate should run;agentclash ci runperforms the orchestration.agentclash ci runcreates a one-off candidate deployment for the manifest build version; cleanup/retention policy is still a follow-up.- PR metadata such as repository, pull request number, branch, and commit SHA is not yet attached to runs by the CLI.
- PR comments, check-run summaries, and uploaded JSON artifacts are still follow-up work.
- Automatic regression promotion should remain opt-in and conservative.