Guides

CI/CD Agent Gates

Use a repo-tracked AgentClash CI manifest to define which agent revision, workload, baseline, and gate a pull request should run.

AgentClash CI should gate an agent revision, not only a prompt diff.

Prompt-focused tools can usually watch prompts/** and rerun a prompt eval. AgentClash's main product model is richer: an agent change can touch instructions, workflow code, tool bindings, model aliases, runtime limits, output schemas, guardrails, or retrieval configuration. The CI contract therefore needs to name the candidate agent build, deployment settings, challenge workload, baseline, and gate policy explicitly.

The manifest is the contract

Create a repo-tracked manifest:

bash

1agentclash ci init .agentclash/ci.yaml
2agentclash ci validate .agentclash/ci.yaml
3agentclash ci validate .agentclash/ci.yaml --remote --json
4agentclash ci baseline --manifest .agentclash/ci.yaml --json
5agentclash ci should-run --changed-file prompts/system.md --json

The generated manifest has this shape:

yaml

1version: 1
2trigger:
3  paths:
4    - .agentclash/agent.json
5    - prompts/**
6    - tools/**
7  labels:
8    - agentclash/eval
9candidate:
10  build:
11    agent_build_id: 00000000-0000-0000-0000-000000000001
12    spec_file: .agentclash/agent.json
13  deployment:
14    name: pr-candidate
15    runtime_profile_id: 00000000-0000-0000-0000-000000000002
16    provider_account_id: 00000000-0000-0000-0000-000000000003
17    model_alias_id: 00000000-0000-0000-0000-000000000004
18evaluation:
19  challenge_pack_version_id: 00000000-0000-0000-0000-000000000005
20  input_set_id: 00000000-0000-0000-0000-000000000006
21  # For deterministic voice eval packs, set: mode: text-sim
22  regression_suites:
23    - 00000000-0000-0000-0000-000000000007
24baseline:
25  run_id: 00000000-0000-0000-0000-000000000008
26  refresh: manual
27  max_age_days: 30
28gate:
29  fail_on: regression
30regressions:
31  promote_failures: proposed

The IDs in the generated file are placeholders. Replace them with workspace resources before using the manifest for a real gate.

Local validation is always offline. Add --remote when you want the CLI to call the AgentClash API and verify that the manifest's agent build, runtime profile, provider account, model alias, challenge pack version, input set, regression suites or cases, and baseline are visible from the selected workspace. Because this makes real authenticated API calls, set AGENTCLASH_API_URL, AGENTCLASH_TOKEN, and AGENTCLASH_WORKSPACE in CI and expect normal API latency, rate limits, and token scoping rules. JSON output includes a remote.checks[] entry per referenced field, so CI can report whether a failure came from the local manifest contract or from an API/resource check.

What each section means

trigger says which repository paths and optional labels should cause the workflow to run.
candidate.build names the existing AgentClash build and the source-backed build-version spec to test.
candidate.deployment names the runtime resources used for the candidate deployment.
evaluation names the workload: challenge pack version, optional input set, and optional regression suites or cases. For voice packs, set the optional evaluation.mode: text-sim to request a deterministic text-simulated voice eval (text-sim is the only mode supported today; audio-sim, live-call, and replay-import are reserved for future use).
baseline names the locked reference run or deployment, plus explicit refresh and staleness rules.
gate names the release-gate failure threshold.
regressions controls whether failed cases should only be reported, proposed for promotion, or eventually auto-promoted on main.

The important distinction is:

text

1agent build/deployment = thing under test
2challenge pack/regression suite = workload used to test it
3release gate = decision policy

If you are deciding what the workload should contain, use CI/CD Workload Recipes for coding, research, support/ops, and long-horizon agent patterns.

Baseline strategy and refresh

For pull request gates, prefer baseline.run_id. It pins the exact accepted mainline run, so every reviewer can see what changed when the baseline moves. Add baseline.run_agent_id only when the locked run has multiple participants and the gate must compare against one specific agent lane.

Use baseline.deployment_id only when the team intentionally wants a moving selector. agentclash ci baseline resolves it to the newest completed run in the workspace that matches the manifest workload and includes that deployment. The command prints the exact resolved run_id and run_agent_id so downstream automation still compares against concrete IDs.

Use baseline.max_age_days when a stale baseline should block the gate. The resolver checks the chosen run's finished_at or created_at timestamp and fails instead of silently comparing against old behavior.

Refreshes are explicit:

yaml

1baseline:
2  run_id: 00000000-0000-0000-0000-000000000008
3  refresh: manual
4  max_age_days: 30

manual: after a successful mainline run, update baseline.run_id in a reviewed change.
propose: automation may propose the new baseline, but a human still reviews the manifest change.
auto_on_main: a protected mainline workflow may update the manifest with an auditable commit after the gate passes.

Resolve the baseline before running a gate:

bash

1agentclash ci baseline \
2  --manifest .agentclash/ci.yaml \
3  --json

The JSON includes strategy, source, baseline.run_id, optional baseline.run_agent_id, refresh.mode, and refresh.next_action.

Decide whether CI should run

Use agentclash ci should-run when you want AgentClash to explain whether a pull request touches the agent contract. A matching path or label produces should_run: true; unrelated docs-only changes produce should_run: false.

bash

1agentclash ci should-run \
2  --manifest .agentclash/ci.yaml \
3  --changed-file prompts/system.md

Labels can force the gate even when paths do not match:

bash

1agentclash ci should-run \
2  --manifest .agentclash/ci.yaml \
3  --changed-file docs/readme.md \
4  --labels agentclash/eval \
5  --json

In GitHub Actions, ci should-run reads pull request labels from GITHUB_EVENT_PATH automatically when --labels is omitted, and the bundled action passes that behavior through. Use --github-event <path> only when testing a saved event payload locally.

For local or GitHub Actions diffing, pass refs explicitly:

bash

1agentclash ci should-run \
2  --manifest .agentclash/ci.yaml \
3  --base origin/main \
4  --head HEAD \
5  --json

GitHub Actions sketch

The manifest is the single source of truth for the candidate revision, workload, baseline, and gate. A pull request workflow can validate it, decide whether it should run for the changed files, then let agentclash ci run create the candidate build version, deployment, run, and release-gate evaluation.

Use the reusable AgentClash action when you want the standard GitHub integration without rewriting the shell glue:

yaml

1name: AgentClash gate
2
3on:
4  pull_request:
5    paths:
6      - ".agentclash/**"
7      - "prompts/**"
8      - "tools/**"
9
10jobs:
11  agentclash:
12    runs-on: ubuntu-latest
13    permissions:
14      contents: read
15      pull-requests: write
16
17    steps:
18      - uses: actions/checkout@v4
19        with:
20          fetch-depth: 0
21
22      - uses: actions/setup-node@v4
23        with:
24          node-version: "22"
25
26      - name: Run AgentClash CI gate
27        id: agentclash
28        uses: agentclash/agentclash/.github/actions/agentclash-ci@main
29        with:
30          token: ${{ secrets.AGENTCLASH_TOKEN }}
31          workspace: ${{ secrets.AGENTCLASH_WORKSPACE }}
32          manifest: .agentclash/ci.yaml
33
34      - name: Upload AgentClash gate artifacts
35        if: always() && steps.agentclash.outputs['should-run'] == 'true'
36        uses: actions/upload-artifact@v4
37        with:
38          name: agentclash-ci
39          path: |
40            ${{ steps.agentclash.outputs.result-file }}
41            ${{ steps.agentclash.outputs.artifact-dir }}/*.json

The action installs the published agentclash npm package by default, runs ci validate --remote, runs ci should-run, auto-detects pull request labels from the GitHub event payload, skips unrelated changes, runs ci run when matched, posts or updates a sticky structured PR comment when pull request context is available, and exposes should-run, skip-reason, run-id, gate-verdict, exit-code, result-file, and artifact-dir outputs. It preserves the CLI exit code, so a blocking gate fails the workflow normally. Grant pull-requests: write when you want GitHub-hosted PR comments; commenting is best-effort and permission failures do not override the AgentClash result. When run metadata is available, the comment links reviewers directly to the AgentClash candidate run, baseline run, comparison, failures, scorecard, replay, and regression cases. If setup fails before a candidate run is created, the sticky comment reports the errored setup state and points reviewers at the GitHub Actions log.

agentclash ci run exits nonzero when the gate verdict should block CI, when the candidate run times out, or when the manifest/API setup is invalid. In GitHub Actions, it automatically attaches repository, pull request, branch, default branch, commit, workflow, event, and workflow-run URL metadata to the AgentClash run. It also appends a reviewer-friendly Markdown section when the $GITHUB_STEP_SUMMARY environment variable is set, while the bundled action turns the same run evidence into the PR comment. Pass --summary-file <path> for another Markdown destination, or --github-step-summary=false to disable the automatic GitHub summary.

--artifact-dir writes stable JSON files intended for actions/upload-artifact: result.json for the final CLI envelope, run.json for run creation/completion payloads, scorecard.json for candidate scorecard evidence, comparison.json for baseline/candidate comparison evidence, and gate.json for the release-gate verdict and policy metadata. The summary and artifacts include the challenge pack version, baseline, candidate, policy, verdict, top evidence lines, regression candidate promotion outcomes, and AgentClash links when the API returns them. Use --ci-repository, --ci-pull-request, --ci-branch, --ci-default-branch, --ci-commit, and the other --ci-* flags when running from another CI system or a custom wrapper.

Regression promotion policy

Do not auto-promote every PR failure by default. A bad run, flaky dependency, or weak evaluator could pollute the regression suite. Use this conservative progression:

yaml

1regressions:
2  promote_failures: disabled

Report failures only. When the gate fails, the CLI records that promotion was skipped and does not call failure-listing or promotion endpoints.

yaml

1regressions:
2  promote_failures: proposed

Create reviewable candidates after a failing gate. The CLI lists the candidate run's failure-review items, checks each target suite from evaluation.regression_suites, skips any challenge identity that already has a non-archived/non-rejected case, then calls the promote-failure API with status: proposed.

Proposed cases appear in the regression suite UI without entering future runs. A reviewer can accept them by changing status to active, or reject/archive them if the failure is noisy, duplicated, or not worth keeping.

yaml

1regressions:
2  promote_failures: auto_on_main

Create active cases only from protected default-branch runs. The CLI refuses pull request events, refs/pull/*, missing default branch metadata, and non-default branches. GitHub Actions usually supplies the default branch through the event payload; custom CI wrappers should pass --ci-default-branch main.

All modes preserve the original gate exit code. Promotion errors are reported in the human output, JSON regression_promotions.errors, GitHub step summary, and artifact result.json, but a blocking regression still exits with the gate failure code.

Current limits

agentclash ci validate validates the manifest shape locally; pass --remote for API-backed resource checks.
agentclash ci should-run only decides whether a gate should run; agentclash ci run performs the orchestration.
agentclash ci run creates a one-off candidate deployment for the manifest build version; cleanup/retention policy is still a follow-up.
GitHub Check Runs with rich annotations are still follow-up work; use the sticky PR comment, GitHub step summary, and uploaded JSON artifacts today.
Regression candidate promotion requires evaluation.regression_suites; without at least one target suite, ci run reports promotion as blocked.

The manifest is the contract

What each section means

Baseline strategy and refresh

Decide whether CI should run

GitHub Actions sketch

Regression promotion policy

Current limits

See also