2026-07-01 · Atharva
Introducing DataSmith: High-Signal Synthetic Data for AI Agents
Most self-instruct pipelines produce volume, not signal. You get thousands of prompt-completion pairs where the weak model already succeeds or the task is impossible to learn from. That noise shows up later as flat fine-tunes and eval sets that do not match production.
DataSmith fixes the loop. It implements the practical weak-vs-strong Agentic Self-Instruct pattern from Meta FAIR's Autodata work: a challenger proposes examples, weak and strong solvers attempt them, and a judge accepts only when the gap is useful for training.
The problem with prompt-only synthetic data
Classic self-instruct assumes you already have seed examples and that bulk generation is good enough. In practice:
- Seeds are thin or off-domain without web grounding.
- Generated tasks are too easy (weak model passes) or too hard (strong model fails).
- Rejected examples disappear instead of feeding the next attempt.
- Export stops at JSONL instead of trainer-ready SFT, DPO, or Hugging Face formats.
DataSmith treats data generation as a repeatable pipeline, not a one-shot prompt.
Two stages: seeds, then generation
Stage 1: Seed construction
Start from a domain brief. DataSmith's seed constructor can use web-search signals to bootstrap grounded seed examples. A seed judge filters for specificity and usefulness before anything enters the generation loop.
Stage 2: Weak-vs-strong generation
For each seed, a challenger proposes a candidate example. Weak and strong solvers attempt it. A judge scores quality and separation. Accepted rows land in accepted.jsonl; rejections keep reason codes, solver attempts, and feedback for the next round.
pip install datasmith
datasmith construct-seeds --brief "customer support refund policy" --output seeds.jsonl
datasmith run --seeds seeds.jsonl --output-dir ./artifacts
datasmith export --input ./artifacts/accepted.jsonl --format sharegpt --output train.jsonl
The Python import is asi for compatibility. The CLI answers to both datasmith and asi.
Provider-agnostic by design
DataSmith works with any model that implements a simple complete() protocol:
- OpenAI-compatible APIs (OpenAI, Groq, Together, local vLLM)
- Your own wrapper for Anthropic, Gemini, or internal gateways
- Deterministic demo models for tests and tutorials
No vendor lock-in. Swap weak and strong solvers per domain without rewriting the loop.
Turn production traces into training seeds
Real agent runs are often the best seed material. DataSmith ingests OpenTelemetry JSON and flattened span JSONL:
datasmith ingest-otel --input traces.jsonl --output seeds.jsonl
That closes the loop between observability and fine-tuning: production failures become curated training examples instead of dashboard noise.
Export for SFT, DPO, and Hugging Face
Accepted artifacts export to formats trainers actually consume:
- ShareGPT and ChatML for supervised fine-tuning
- DPO preference pairs when weak and strong outputs diverge
- Prompt-completion JSONL for simpler pipelines
- Direct push to Hugging Face Hub with the optional
[hf]extra
See the GSM8K + Qwen DPO benchmark writeup for a reproducible end-to-end example.
DataSmith inside AgentClash
DataSmith is the local, training-oriented layer. AgentClash is the hosted eval and regression layer.
Inside AgentClash workspaces you can run the same weak-vs-strong loop as Agentic Self-Instruct generation on pinned datasets. Accepted synthetic rows become eval examples you can baseline, gate in CI, and promote into regression suites.
| Capability | DataSmith (Python SDK) | AgentClash (hosted) |
|---|---|---|
| Web-grounded seed construction | Yes | Seeds from existing examples |
| Weak-vs-strong judge loop | Yes | Yes (agentic_self_instruct) |
| OTLP trace ingestion | Yes | Trace import to candidates |
| SFT / DPO / HF export | Yes | Eval and regression formats |
| CI regression gates | Manual | Built-in dataset gates |
Use DataSmith when you need offline dataset creation for fine-tunes. Use AgentClash when you need replay, scoring, and release gates on the same workloads.
What DataSmith is not
DataSmith is the practical substrate, not Meta's full meta-optimization stack. It does not run outer-loop prompt evolution or RL training. It gives you the inner loop: ingestion, orchestration, acceptance policies, artifacts, CLI, and tests you can plug into whatever you are building.
MIT licensed. Alpha (0.1.0). Feedback and contributions welcome on GitHub.
Get started
- Local SDK:
pip install datasmithand follow the README quickstart. - Hosted generation: Sign in at agentclash.dev, open a dataset, and start Agentic Self-Instruct generation from the workspace UI.
- Read the platform overview: /platform/datasmith for the full AgentClash + DataSmith story.
Tell us what domain you want to generate data for first. That feedback shapes the next seed-constructor templates and export presets.
Explore