Guides

Datasets overview

Generate pinned dataset evals, record baselines, sync regression suites, and gate releases with agentclash dataset commands.

Dataset evals run a pinned version of labeled examples through a challenge pack and deployment, then compare candidate runs against recorded baselines. They complement ad-hoc eval runs when you need reproducible regression signal tied to curated examples.

Lifecycle

  1. Create or import a dataset in the workspace (web UI or API).
  2. Pin a version with examples linked to challenge cases.
  3. Run a dataset eval — AgentClash executes each example and aggregates pass/fail.
  4. Record a baseline from a green eval run.
  5. Gate CI with agentclash dataset test (see Dataset CI Gates).
bash
1export AGENTCLASH_API_URL="https://api.agentclash.dev"
2export AGENTCLASH_TOKEN="<token>"
3export AGENTCLASH_WORKSPACE="<workspace-id>"
4
5# Start an eval against a pinned version
6agentclash dataset eval <datasetId> \
7  --version <dataset-version-id> \
8  --pack <challenge-pack-version-id> \
9  --challenge support \
10  --deployment <deployment-id>

Baselines and regression suites

Baselines snapshot per-example outcomes from a completed eval. Syncing a pinned version into a regression suite copies provenance (dataset_example_id, trace metadata) into workspace_regression_cases so manifest-based release gates can reference the same examples.

bash
1agentclash dataset sync-regression-suite <datasetId> \
2  --version <dataset-version-id> \
3  --pack <challenge-pack-version-id> \
4  --challenge support \
5  --suite-name "Support dataset regression"

Re-running sync is idempotent: existing cases keyed by dataset_example_id are skipped.

CI gate output formats

agentclash dataset test supports human text, JSON gate payloads, and JUnit XML for CI parsers. Use --max-regressions 0 to fail on any newly failing example versus the baseline.

Info

Dataset gates are complementary to CI/CD agent gates and eval workflows. Use datasets when examples are curated and pinned; use eval scorecards when you are comparing agent revisions on a live pack.

See also