Question 1

What is DataSmith?

Accepted Answer

DataSmith is an open-source Python SDK for synthetic dataset generation using a weak-vs-strong agentic loop. A challenger proposes examples, weak and strong solvers attempt them, and a judge accepts only high-signal rows ready for fine-tuning or evaluation.

Question 2

How is DataSmith related to AgentClash?

Accepted Answer

DataSmith handles offline dataset creation and training export (SFT, DPO, Hugging Face). AgentClash runs the same Agentic Self-Instruct loop in hosted workspaces, then scores, replays, and gates regressions on the generated examples.

Question 3

What models does DataSmith support?

Accepted Answer

Any provider that implements DataSmith's complete() protocol: OpenAI-compatible APIs, local inference, or your own wrapper. Weak and strong solvers can be different models, prompts, or compute budgets.

Question 4

Can I turn production traces into training data?

Accepted Answer

Yes. DataSmith ingests OpenTelemetry JSON and span JSONL as seeds. AgentClash also supports trace import into workspace datasets for eval-oriented workflows.

High-signal synthetic data, not noisy bulk

Examples in the useful difficulty zone

From domain brief to trainer-ready export

Construct seeds

Run weak vs strong

Export for training

Eval and gate in AgentClash

Local SDK or hosted workspace generation

Introducing DataSmith

Synthetic dataset generation guide

Agentic Self-Instruct SEO hub

Datasets overview

Questions teams ask about synthetic data generation

What is DataSmith?

How is DataSmith related to AgentClash?

What models does DataSmith support?

Can I turn production traces into training data?