Guides

Dataset CI Gates

Record dataset eval baselines, sync examples into regression suites, and fail CI when a candidate run regresses.

Dataset CI gates close the loop between pinned dataset evals and release safety. After you run a dataset eval, record a baseline from the green run, optionally sync the pinned version into a regression suite, and gate every subsequent change with agentclash dataset test.

Record a baseline

Baselines capture per-example outcomes from a completed dataset eval run. Record one through the API (or the web app once baseline creation UI ships):

bash
1export AGENTCLASH_API_URL="https://api.agentclash.dev"
2export AGENTCLASH_TOKEN="<token>"
3export AGENTCLASH_WORKSPACE="<workspace-id>"
4
5curl -sS -X POST "$AGENTCLASH_API_URL/v1/workspaces/$AGENTCLASH_WORKSPACE/datasets/<datasetId>/baselines" \
6  -H "Authorization: Bearer $AGENTCLASH_TOKEN" \
7  -H "Content-Type: application/json" \
8  -d '{"run_id":"<completed-eval-run-id>","label":"main baseline"}'

List baselines from the API or the dataset detail Regression / CI tab in the web app.

Sync into a regression suite

Promote the pinned dataset version into workspace_regression_cases while preserving provenance (dataset_example_id, trace metadata, curation links):

bash
1agentclash dataset sync-regression-suite <datasetId> \
2  --version <dataset-version-id> \
3  --pack <challenge-pack-version-id> \
4  --challenge support \
5  --suite-name "Support dataset regression"

Re-running sync is idempotent: existing cases keyed by metadata.dataset_example_id are skipped; new examples are appended. The dataset keeps a single link row pointing at the suite and last synced version.

Wire the linked suite into .agentclash/ci.yaml under evaluation.regression_suites when you also want manifest-based release gates.

Gate a candidate run

Compare a candidate eval run against a baseline:

bash
1# Use an existing candidate run
2agentclash dataset test <datasetId> \
3  --baseline <baseline-id> \
4  --run <candidate-run-id> \
5  --max-regressions 0
6
7# Or start a fresh eval and wait for completion
8agentclash dataset test <datasetId> \
9  --baseline <baseline-id> \
10  --eval \
11  --version <dataset-version-id> \
12  --pack <challenge-pack-version-id> \
13  --challenge support \
14  --deployment <deployment-id>

Output formats:

  • --format text (default) prints a human summary and exits 1 on regression.
  • --format json prints the full gate payload (including on HTTP 422) and exits 1 on failure.
  • --format junit emits JUnit XML for CI parsers and exits 1 on failure.

Threshold flags:

  • --min-pass-rate 0.95 fails when candidate pass rate drops below 95%.
  • --max-regressions 0 fails on any newly failing or missing baseline example.

GitHub Actions recipe

Minimal workflow step that runs a dataset gate on pull requests:

yaml
1name: dataset-gate
2
3on:
4  pull_request:
5    paths:
6      - prompts/**
7      - tools/**
8      - .agentclash/**
9
10jobs:
11  gate:
12    runs-on: ubuntu-latest
13    steps:
14      - uses: actions/checkout@v4
15
16      - name: Install AgentClash CLI
17        run: npm install -g agentclash
18
19      - name: Run dataset gate
20        env:
21          AGENTCLASH_API_URL: https://api.agentclash.dev
22          AGENTCLASH_TOKEN: ${{ secrets.AGENTCLASH_TOKEN }}
23          AGENTCLASH_WORKSPACE: ${{ vars.AGENTCLASH_WORKSPACE }}
24          DATASET_ID: ${{ vars.AGENTCLASH_DATASET_ID }}
25          BASELINE_ID: ${{ vars.AGENTCLASH_DATASET_BASELINE_ID }}
26          PACK_VERSION_ID: ${{ vars.AGENTCLASH_PACK_VERSION_ID }}
27          DEPLOYMENT_ID: ${{ vars.AGENTCLASH_DEPLOYMENT_ID }}
28        run: |
29          agentclash dataset test "$DATASET_ID" \
30            --baseline "$BASELINE_ID" \
31            --eval \
32            --version "${{ vars.AGENTCLASH_DATASET_VERSION_ID }}" \
33            --pack "$PACK_VERSION_ID" \
34            --challenge support \
35            --deployment "$DEPLOYMENT_ID" \
36            --max-regressions 0 \
37            --format junit > dataset-gate.xml
38
39      - name: Publish JUnit report
40        if: always()
41        uses: mikepenz/action-junit-report@v5
42        with:
43          report_paths: dataset-gate.xml

For full agent release gates (build spec, deployment, regression suites, and release-gate policy), combine this with the manifest flow in CI/CD Agent Gates.

API surface

MethodPathPurpose
POST/v1/workspaces/{workspaceID}/datasets/{datasetID}/baselinesRecord baseline from completed eval run
GET/v1/workspaces/{workspaceID}/datasets/{datasetID}/baselinesList baselines
POST/v1/workspaces/{workspaceID}/datasets/{datasetID}/gateCompare candidate run vs baseline (422 on regression)
GET/v1/workspaces/{workspaceID}/datasets/{datasetID}/regression-suiteFetch dataset ⇄ suite link
POST/v1/workspaces/{workspaceID}/datasets/{datasetID}/regression-suite/syncPromote version examples into regression cases

See the OpenAPI spec for request and response schemas.

See also