AgentClash

Glossary

What is a challenge pack?

Challenge packs are AgentClash's unit of repeatable agent evaluation. Encode the task once so every model, prompt, or harness change reruns the same workload with the same constraints.

live race
gate: pass

Candidate

92correct patch, low cost

Baseline

88stable reference run

Control

73missed edge case

replay timeline

1loaded task inputs and tool policy
2ran sandbox actions and captured artifacts
3scored trajectory and validator evidence
4attached scorecard and release verdict

ci verdict

Candidate clears release gate

Correctness improved, latency within budget, and required artifacts were preserved for review.

agentclash run create --follow

What packs contain

Built for reviewable agent decisions

Inputs, tool policy, sandbox resources, validators, judges, artifacts, and pass conditions that together define a fair race or regression test.

Sandboxed real-tool execution

Head-to-head runs with fair constraints

Scorecards for correctness, cost, latency, and tool strategy

Replay trails for every important action

Challenge packs that turn failures into reusable tests

CI gates for baseline versus candidate decisions

Workflow

From pack to gate

Package the task

Describe the workload as a challenge pack with inputs, tools, scoring rules, and artifacts.

Race the agents

Run every candidate against the same task with the same constraints.

Replay the evidence

Inspect tool calls, outputs, artifacts, latency, cost, and judge evidence after the run.

Gate the release

Compare candidate and baseline runs, then fail CI before a regression reaches users.

FAQ

Challenge pack FAQ

How is a challenge pack versioned?

Packs are versioned YAML bundles in your workspace. Pin a version for benchmarks and CI so comparisons stay reproducible.

Can one pack power both benchmarks and CI?

Yes. The same frozen pack can back a public benchmark race and an internal release gate once your team trusts the scoring rules.

Where are examples?

See example packs in the repository and the challenge pack reference docs for field-by-field authoring.