Guides

Synthetic dataset generation

Generate high-signal synthetic examples with Self-Instruct or Agentic Self-Instruct weak-vs-strong loops in AgentClash workspaces.

AgentClash can expand pinned datasets with synthetic examples generated from existing seeds. Two strategies are available:

  • Fast Self-Instruct — prompt-only generation for quick volume when you already trust your seeds.
  • Agentic Self-Instruct — weak-vs-strong solver rollouts with a judge that accepts only examples in the useful difficulty zone.

The Agentic Self-Instruct loop follows the pattern described in Meta FAIR Autodata. For offline training export (SFT, DPO, Hugging Face), use the open-source DataSmith Python SDK.

When to use each strategy

StrategyBest forTradeoff
Fast Self-InstructRapid expansion of well-curated seedsLess quality filtering
Agentic Self-InstructTargeted eval and regression datasetsHigher token cost per accepted row

Start with Agentic Self-Instruct when examples will gate releases or train downstream models. Use Fast Self-Instruct for exploratory coverage.

Prerequisites

  • A workspace dataset with at least one seed example (or import traces first).
  • Provider API keys or BYOK deployments configured in the workspace.
  • For Agentic Self-Instruct: weak and strong solver models (or deployments) selected in the generation dialog.

Start generation from the web UI

  1. Open Workspaces → Datasets → [your dataset].
  2. Click Synthetic generation.
  3. Choose Fast Self-Instruct or Agentic Self-Instruct.
  4. Pin a target dataset version label for accepted rows.
  5. For Agentic Self-Instruct, configure acceptance thresholds (strong pass rate, weak fail rate, minimum score gap).
  6. Start the job and monitor progress. Rejected rows include reason codes for review.

Accepted examples are written with source=synthetic and appear in the dataset version you specified.

Start generation from the CLI

bash
1export AGENTCLASH_API_URL="https://api.agentclash.dev"
2export AGENTCLASH_TOKEN="<token>"
3export AGENTCLASH_WORKSPACE="<workspace-id>"
4
5agentclash dataset generate <datasetId> \
6  --strategy agentic_self_instruct \
7  --version-label "synthetic-v1" \
8  --max-examples 50

Poll job status:

bash
agentclash dataset generate status <datasetId> <jobId>

See agentclash dataset generate --help for weak/strong deployment flags and acceptance tuning.

Agentic Self-Instruct acceptance policy

The judge accepts a candidate when:

  • The strong solver meets your configured pass threshold.
  • The weak solver falls below its pass threshold (the example is learnable).
  • The score gap between strong and weak exceeds your minimum gap.

The job summary includes avg_gap across accepted rows so you can compare runs over time.

After generation: eval and gates

Synthetic rows are most valuable when they feed regression coverage:

  1. Run a dataset eval against a challenge pack — see Datasets overview.
  2. Record a baseline from a green eval run.
  3. Gate CI with agentclash dataset test — see Dataset CI gates.
  4. Sync into a regression suite so escaped failures stay covered.

DataSmith for training export

AgentClash optimizes for eval-ready datasets inside your workspace. When you need ShareGPT, DPO pairs, or Hugging Face Hub export for fine-tuning, use DataSmith:

bash
1pip install datasmith
2datasmith construct-seeds --brief "your domain" --output seeds.jsonl
3datasmith run --seeds seeds.jsonl --output-dir ./artifacts
4datasmith export --input ./artifacts/accepted.jsonl --format dpo --output pairs.jsonl

DataSmith also ingests OpenTelemetry traces locally. AgentClash trace import targets eval candidates in the hosted product.

See also