Guides

Datasets overview

Generate pinned dataset evals, record baselines, sync regression suites, and gate releases with agentclash dataset commands.

Dataset evals run a pinned version of labeled examples through a challenge pack and deployment, then compare candidate runs against recorded baselines. They complement ad-hoc eval runs when you need reproducible regression signal tied to curated examples.

Lifecycle

Create or import a dataset in the workspace (web UI or API).
Pin a version with examples linked to challenge cases.
Run a dataset eval — AgentClash executes each example and aggregates pass/fail.
Record a baseline from a green eval run.
Gate CI with agentclash dataset test (see Dataset CI Gates).

bash

1export AGENTCLASH_API_URL="https://api.agentclash.dev"
2export AGENTCLASH_TOKEN="<token>"
3export AGENTCLASH_WORKSPACE="<workspace-id>"
4
5# Start an eval against a pinned version
6agentclash dataset eval <datasetId> \
7  --version <dataset-version-id> \
8  --pack <challenge-pack-version-id> \
9  --challenge support \
10  --deployment <deployment-id>

Record a baseline

A baseline snapshots the per-example outcomes of one completed, green eval run so later candidates can be compared against it. Recording a baseline is an API call (no CLI subcommand yet): POST /v1/workspaces/{workspaceID}/datasets/{datasetID}/baselines with the eval run_id and an optional label. The response returns a baseline id — pass that to agentclash dataset test --baseline <id> to gate candidates against it.

bash

1curl -sS -X POST \
2  "$AGENTCLASH_API_URL/v1/workspaces/$AGENTCLASH_WORKSPACE/datasets/<datasetId>/baselines" \
3  -H "Authorization: Bearer $AGENTCLASH_TOKEN" \
4  -H "Content-Type: application/json" \
5  -d '{"run_id": "<eval-run-id>", "label": "support v1 baseline"}'

Baselines and regression suites

Syncing a pinned dataset version into a regression suite copies each example's provenance — its dataset-example ID and trace metadata — into the workspace regression suite, so manifest-based release gates can reference the same curated examples. (Under the hood these land in the workspace_regression_cases table keyed by dataset_example_id.)

bash

1agentclash dataset sync-regression-suite <datasetId> \
2  --version <dataset-version-id> \
3  --pack <challenge-pack-version-id> \
4  --challenge support \
5  --suite-name "Support dataset regression"

Re-running sync is idempotent: cases that already exist for an example are skipped, so you can safely re-sync after adding new examples.

CI gate output formats

agentclash dataset test supports human text, JSON gate payloads, and JUnit XML for CI parsers. Use --max-regressions 0 to fail on any newly failing example versus the baseline.

Info

Dataset gates are complementary to CI/CD agent gates and eval workflows. Use datasets when examples are curated and pinned; use eval scorecards when you are comparing agent revisions on a live pack.

Lifecycle

Record a baseline

Baselines and regression suites

CI gate output formats

See also