Guides
Datasets overview
Generate pinned dataset evals, record baselines, sync regression suites, and gate releases with agentclash dataset commands.
Dataset evals run a pinned version of labeled examples through a challenge pack and deployment, then compare candidate runs against recorded baselines. They complement ad-hoc eval runs when you need reproducible regression signal tied to curated examples.
Lifecycle
- Create or import a dataset in the workspace (web UI or API).
- Pin a version with examples linked to challenge cases.
- Run a dataset eval — AgentClash executes each example and aggregates pass/fail.
- Record a baseline from a green eval run.
- Gate CI with
agentclash dataset test(see Dataset CI Gates).
1export AGENTCLASH_API_URL="https://api.agentclash.dev"
2export AGENTCLASH_TOKEN="<token>"
3export AGENTCLASH_WORKSPACE="<workspace-id>"
4
5# Start an eval against a pinned version
6agentclash dataset eval <datasetId> \
7 --version <dataset-version-id> \
8 --pack <challenge-pack-version-id> \
9 --challenge support \
10 --deployment <deployment-id>Baselines and regression suites
Baselines snapshot per-example outcomes from a completed eval. Syncing a pinned version into a regression suite copies provenance (dataset_example_id, trace metadata) into workspace_regression_cases so manifest-based release gates can reference the same examples.
1agentclash dataset sync-regression-suite <datasetId> \
2 --version <dataset-version-id> \
3 --pack <challenge-pack-version-id> \
4 --challenge support \
5 --suite-name "Support dataset regression"Re-running sync is idempotent: existing cases keyed by dataset_example_id are skipped.
CI gate output formats
agentclash dataset test supports human text, JSON gate payloads, and JUnit XML for CI parsers. Use --max-regressions 0 to fail on any newly failing example versus the baseline.
Info
Dataset gates are complementary to CI/CD agent gates and eval workflows. Use datasets when examples are curated and pinned; use eval scorecards when you are comparing agent revisions on a live pack.
See also
- Dataset CI Gates — GitHub Actions recipe and API surface
- CI/CD workload recipes — pick realistic agent workloads
- Interpret results — read scorecards and timelines