Synthetic data for agents
High-signal synthetic data, not noisy bulk
DataSmith generates training and eval datasets with a weak-vs-strong agentic loop inspired by Meta FAIR Autodata. Use the open-source Python SDK locally, or run Agentic Self-Instruct inside AgentClash for replay, scoring, and CI gates.
acceptance policy
pass >= 0.85
pass <= 0.55
gap >= 0.20
grounded + rubric
export targets
pip install datasmith datasmith run --seeds seeds.jsonl
Why weak vs strong
Examples in the useful difficulty zone
Prompt-only self-instruct often produces tasks that are too easy or too hard. The weak-vs-strong loop accepts rows only when a strong solver succeeds and a weak solver struggles: the zone where fine-tuning actually teaches something.
Web-grounded seed construction from domain briefs
Weak-vs-strong Agentic Self-Instruct judge loop
OpenTelemetry trace ingestion for seed material
Accepted and rejected JSONL with reason codes
ShareGPT, ChatML, DPO, and prompt-completion export
Provider-agnostic model interface (OpenAI-compatible or custom)
Hosted Agentic Self-Instruct inside AgentClash workspaces
Dataset baselines, eval runs, and CI regression gates
Workflow
From domain brief to trainer-ready export
Construct seeds
Start from a domain brief, web-grounded search, production traces, or source documents.
Run weak vs strong
Challenger proposes examples. Weak and strong solvers attempt them. Judge filters for the useful difficulty zone.
Export for training
Ship ShareGPT, ChatML, DPO pairs, or push accepted rows to Hugging Face Hub.
Eval and gate in AgentClash
Run hosted Agentic Self-Instruct generation, baseline the dataset, and wire CI regression gates.
Start here
Local SDK or hosted workspace generation
Pick DataSmith when you need offline SFT and DPO export. Pick AgentClash when the same examples should baseline evals and block regressions in CI.
Introducing DataSmith
Launch blog: weak-vs-strong loop, trace ingestion, and export formats.
Synthetic dataset generation guide
Run Agentic Self-Instruct inside AgentClash workspaces.
Agentic Self-Instruct SEO hub
Keyword landing for the Autodata-inspired generation pattern.
Datasets overview
Baselines, eval runs, and regression suite sync after generation.
FAQ
Questions teams ask about synthetic data generation
What is DataSmith?
DataSmith is an open-source Python SDK for synthetic dataset generation using a weak-vs-strong agentic loop. A challenger proposes examples, weak and strong solvers attempt them, and a judge accepts only high-signal rows ready for fine-tuning or evaluation.
How is DataSmith related to AgentClash?
DataSmith handles offline dataset creation and training export (SFT, DPO, Hugging Face). AgentClash runs the same Agentic Self-Instruct loop in hosted workspaces, then scores, replays, and gates regressions on the generated examples.
What models does DataSmith support?
Any provider that implements DataSmith's complete() protocol: OpenAI-compatible APIs, local inference, or your own wrapper. Weak and strong solvers can be different models, prompts, or compute budgets.
Can I turn production traces into training data?
Yes. DataSmith ingests OpenTelemetry JSON and span JSONL as seeds. AgentClash also supports trace import into workspace datasets for eval-oriented workflows.