Guides
Dataset CI Gates
Record dataset eval baselines, sync examples into regression suites, and fail CI when a candidate run regresses.
Dataset CI gates close the loop between pinned dataset evals and release safety. After you run a dataset eval, record a baseline from the green run, optionally sync the pinned version into a regression suite, and gate every subsequent change with agentclash dataset test.
Record a baseline
Baselines capture per-example outcomes from a completed dataset eval run. Record one through the API (or the web app once baseline creation UI ships):
1export AGENTCLASH_API_URL="https://api.agentclash.dev"
2export AGENTCLASH_TOKEN="<token>"
3export AGENTCLASH_WORKSPACE="<workspace-id>"
4
5curl -sS -X POST "$AGENTCLASH_API_URL/v1/workspaces/$AGENTCLASH_WORKSPACE/datasets/<datasetId>/baselines" \
6 -H "Authorization: Bearer $AGENTCLASH_TOKEN" \
7 -H "Content-Type: application/json" \
8 -d '{"run_id":"<completed-eval-run-id>","label":"main baseline"}'List baselines from the API or the dataset detail Regression / CI tab in the web app.
Sync into a regression suite
Promote the pinned dataset version into workspace_regression_cases while preserving provenance (dataset_example_id, trace metadata, curation links):
1agentclash dataset sync-regression-suite <datasetId> \
2 --version <dataset-version-id> \
3 --pack <challenge-pack-version-id> \
4 --challenge support \
5 --suite-name "Support dataset regression"Re-running sync is idempotent: existing cases keyed by metadata.dataset_example_id are skipped; new examples are appended. The dataset keeps a single link row pointing at the suite and last synced version.
Wire the linked suite into .agentclash/ci.yaml under evaluation.regression_suites when you also want manifest-based release gates.
Gate a candidate run
Compare a candidate eval run against a baseline:
1# Use an existing candidate run
2agentclash dataset test <datasetId> \
3 --baseline <baseline-id> \
4 --run <candidate-run-id> \
5 --max-regressions 0
6
7# Or start a fresh eval and wait for completion
8agentclash dataset test <datasetId> \
9 --baseline <baseline-id> \
10 --eval \
11 --version <dataset-version-id> \
12 --pack <challenge-pack-version-id> \
13 --challenge support \
14 --deployment <deployment-id>Output formats:
--format text(default) prints a human summary and exits1on regression.--format jsonprints the full gate payload (including on HTTP 422) and exits1on failure.--format junitemits JUnit XML for CI parsers and exits1on failure.
Threshold flags:
--min-pass-rate 0.95fails when candidate pass rate drops below 95%.--max-regressions 0fails on any newly failing or missing baseline example.
GitHub Actions recipe
Minimal workflow step that runs a dataset gate on pull requests:
1name: dataset-gate
2
3on:
4 pull_request:
5 paths:
6 - prompts/**
7 - tools/**
8 - .agentclash/**
9
10jobs:
11 gate:
12 runs-on: ubuntu-latest
13 steps:
14 - uses: actions/checkout@v4
15
16 - name: Install AgentClash CLI
17 run: npm install -g agentclash
18
19 - name: Run dataset gate
20 env:
21 AGENTCLASH_API_URL: https://api.agentclash.dev
22 AGENTCLASH_TOKEN: ${{ secrets.AGENTCLASH_TOKEN }}
23 AGENTCLASH_WORKSPACE: ${{ vars.AGENTCLASH_WORKSPACE }}
24 DATASET_ID: ${{ vars.AGENTCLASH_DATASET_ID }}
25 BASELINE_ID: ${{ vars.AGENTCLASH_DATASET_BASELINE_ID }}
26 PACK_VERSION_ID: ${{ vars.AGENTCLASH_PACK_VERSION_ID }}
27 DEPLOYMENT_ID: ${{ vars.AGENTCLASH_DEPLOYMENT_ID }}
28 run: |
29 agentclash dataset test "$DATASET_ID" \
30 --baseline "$BASELINE_ID" \
31 --eval \
32 --version "${{ vars.AGENTCLASH_DATASET_VERSION_ID }}" \
33 --pack "$PACK_VERSION_ID" \
34 --challenge support \
35 --deployment "$DEPLOYMENT_ID" \
36 --max-regressions 0 \
37 --format junit > dataset-gate.xml
38
39 - name: Publish JUnit report
40 if: always()
41 uses: mikepenz/action-junit-report@v5
42 with:
43 report_paths: dataset-gate.xmlFor full agent release gates (build spec, deployment, regression suites, and release-gate policy), combine this with the manifest flow in CI/CD Agent Gates.
API surface
| Method | Path | Purpose |
|---|---|---|
POST | /v1/workspaces/{workspaceID}/datasets/{datasetID}/baselines | Record baseline from completed eval run |
GET | /v1/workspaces/{workspaceID}/datasets/{datasetID}/baselines | List baselines |
POST | /v1/workspaces/{workspaceID}/datasets/{datasetID}/gate | Compare candidate run vs baseline (422 on regression) |
GET | /v1/workspaces/{workspaceID}/datasets/{datasetID}/regression-suite | Fetch dataset ⇄ suite link |
POST | /v1/workspaces/{workspaceID}/datasets/{datasetID}/regression-suite/sync | Promote version examples into regression cases |
See the OpenAPI spec for request and response schemas.