Agent Skills

Dataset Workflows Skill

Use when managing AgentClash datasets via CLI — create versions, import/export examples, run evals, CI gates, synthetic generation, trace import, candidate review, and regression suite sync.

Canonical source: web/content/agent-skills/agentclash-dataset-workflows/SKILL.md

Markdown export: /docs-md/agent-skills/agentclash-dataset-workflows

Use This Skill When

Use when managing AgentClash datasets via CLI — create versions, import/export examples, run evals, CI gates, synthetic generation, trace import, candidate review, and regression suite sync.

Full SKILL.md

markdown
1---
2name: agentclash-dataset-workflows
3description: Use when managing AgentClash datasets via CLI — create versions, import/export examples, run evals, CI gates, synthetic generation, trace import, candidate review, and regression suite sync.
4metadata:
5  agentclash.role: dataset
6  agentclash.version: "1"
7  agentclash.requires_cli: "true"
8---
9
10# AgentClash Dataset Workflows
11
12## Purpose
13End-to-end dataset operations in a workspace: versioned example banks, eval runs against challenge packs, CI gating, synthetic generation, production trace import, and promotion into regression suites.
14
15## Use When
16- Building or curating labeled examples for prompt or agent evals.
17- Gating merges on dataset eval pass rate vs a baseline.
18- Importing OTEL/Braintrust/LangSmith/Phoenix/AgentClash traces as reviewable candidates.
19- Syncing a dataset version into a linked regression suite.
20
21## Do Not Use When
22- The user only needs a one-off challenge-pack run — use `agentclash-eval-runner`.
23- Prompt matrix experiments without a dataset artifact — use `agentclash-prompt-eval-playground`.
24- Harness coding tasks — use `agentclash-agent-harness-setup`.
25
26## Inputs Needed
27- Workspace ID and dataset ID (create datasets via API/UI if none exist).
28- For eval/gate: dataset version ID, challenge pack version ID, challenge key, deployment IDs.
29- For gate: baseline ID and candidate run ID (or `--eval` to start eval inline).
30- For generate: `--count`, `--provider-account`, `--model-alias`.
31
32## Environment
33```bash
34export AGENTCLASH_API_URL="https://api.agentclash.dev"
35agentclash workspace use <WORKSPACE_ID>
36agentclash dataset list
37agentclash dataset get <DATASET_ID> --json
38```
39
40## Procedure
411. Inspect dataset and versions (`list`, `get`, `versions list`).
422. Import or export examples; optionally create a version snapshot.
433. Run a dataset eval or attach an existing run.
444. Gate with `dataset test` against a baseline (CI-friendly `--format junit`).
455. Optionally generate synthetic examples, import traces, promote candidates, sync regression suite.
46
47## Commands
48
49### Inspect and mutate examples
50```bash
51agentclash dataset list
52agentclash dataset get <dataset-id>
53agentclash dataset versions list <dataset-id>
54agentclash dataset versions create <dataset-id> --label "v2-seeds"
55agentclash dataset import <dataset-id> examples.jsonl
56agentclash dataset export <dataset-id> --version <version-id> -o out.jsonl
57agentclash dataset examples list <dataset-id> --version <version-id>
58agentclash dataset examples add <dataset-id> --input '{"messages":[...]}' --expected '{"score":1}'
59agentclash dataset examples update <dataset-id> <example-id> --expected-file expected.json
60agentclash dataset examples delete <dataset-id> <example-id>
61```
62
63### Eval and CI gate
64```bash
65agentclash dataset eval <dataset-id> \
66  --version <version-id> \
67  --pack <pack-version-id> \
68  --challenge <challenge-key> \
69  --deployment <deployment-id>
70
71agentclash dataset test <dataset-id> \
72  --baseline <baseline-id> \
73  --run <run-id> \
74  --min-pass-rate 0.9 \
75  --max-regressions 0 \
76  --format junit
77
78# Start eval then gate in one command
79agentclash dataset test <dataset-id> \
80  --eval \
81  --version <version-id> \
82  --pack <pack-version-id> \
83  --challenge <challenge-key> \
84  --deployment <deployment-id> \
85  --baseline <baseline-id> \
86  --timeout 30m
87```
88
89### Synthetic generation
90```bash
91agentclash dataset generate <dataset-id> \
92  --count 50 \
93  --provider-account <account-id> \
94  --model-alias <alias-id> \
95  --create-version \
96  --version-label "synthetic-v1" \
97  --follow
98```
99
100### Trace import and promotion
101```bash
102agentclash dataset import-traces <dataset-id> traces.json --source otel
103agentclash dataset import-traces <dataset-id> --source agentclash --run <run-id> --run-agent <run-agent-id>
104agentclash dataset trace-candidates list <dataset-id> --status pending
105agentclash dataset promote <dataset-id> <candidate-id> --tag production --expected-file edited.json
106```
107
108### Regression suite sync
109```bash
110agentclash dataset sync-regression-suite <dataset-id> \
111  --version <version-id> \
112  --pack <pack-version-id> \
113  --challenge <challenge-key> \
114  --suite-name "Dataset regression bank"
115```
116
117## Expected Output
118- Eval creates a run; gate returns pass/fail with regression counts.
119- `dataset test --format junit` exits 0 on pass, 1 on gate failure (422).
120- Generate with `--follow` polls job until completion.
121
122## Failure Modes
123- Gate without `--baseline` → required.
124- Gate without `--run` and without `--eval` → provide one.
125- Generate missing provider/model → all three of count, provider-account, model-alias required.
126- Sync regression without version/pack/challenge → all three flags required.
127
128## Safety Notes
129- Trace imports may contain production data — apply `--redaction` JSON when importing sensitive metadata.
130- Baseline comparisons affect release gates — confirm baseline ID before CI integration.
131- Exported JSONL may include prompts with secrets — scrub before sharing externally.
132
133## Report Back Format
134```text
135Dataset: <id>
136Version: <version-id or n/a>
137Eval run: <run-id or n/a>
138Gate: <pass/fail> — pass rate <x>, regressions <n>
139Candidates: <pending count or n/a>
140Regression suite: <suite-id or n/a>
141Next: agentclash run scorecard <run-id>
142```
143
144## Related Skills
145- `agentclash-hub`
146- `agentclash-eval-runner`
147- `agentclash-regression-flywheel`
148- `agentclash-ci-release-gate`
149- `agentclash-scorecard-reader`
150- `agentclash-prompt-eval-playground`
151
152## Related Docs
153- `/docs-md/guides/datasets-overview`
154- `/docs-md/guides/ci-cd-workload-recipes`
155- `/docs-md/reference/cli`