Glossary
What is a challenge pack?
Challenge packs are AgentClash's unit of repeatable agent evaluation. Encode the task once so every model, prompt, or harness change reruns the same workload with the same constraints.
Candidate
Baseline
Control
replay timeline
ci verdict
Correctness improved, latency within budget, and required artifacts were preserved for review.
agentclash run create --follow
What packs contain
Built for reviewable agent decisions
Inputs, tool policy, sandbox resources, validators, judges, artifacts, and pass conditions that together define a fair race or regression test.
Sandboxed real-tool execution
Head-to-head runs with fair constraints
Scorecards for correctness, cost, latency, and tool strategy
Replay trails for every important action
Challenge packs that turn failures into reusable tests
CI gates for baseline versus candidate decisions
Workflow
From pack to gate
Package the task
Describe the workload as a challenge pack with inputs, tools, scoring rules, and artifacts.
Race the agents
Run every candidate against the same task with the same constraints.
Replay the evidence
Inspect tool calls, outputs, artifacts, latency, cost, and judge evidence after the run.
Gate the release
Compare candidate and baseline runs, then fail CI before a regression reaches users.
Authoring resources
Bring your first workload into the loop
Use the challenge pack docs and authoring guide to publish your first pack.
Agent evals
Real-task agent evals with replay evidence and CI gates.
LLM agent evaluation
Evaluate LLM agents on full trajectories, not one-shot answers.
Compare tools
See how AgentClash differs from prompt-eval platforms.
Challenge packs feature
Feature overview for pack-based eval.
Challenge pack docs
Reference hub for pack authors.
Benchmarks hub
Public races on frozen packs.
Write a challenge pack
Turn a real task into a repeatable agent evaluation.
CI/CD agent gates
Fail a pull request when an agent regresses.
FAQ
Challenge pack FAQ
How is a challenge pack versioned?
Packs are versioned YAML bundles in your workspace. Pin a version for benchmarks and CI so comparisons stay reproducible.
Can one pack power both benchmarks and CI?
Yes. The same frozen pack can back a public benchmark race and an internal release gate once your team trusts the scoring rules.
Where are examples?
See example packs in the repository and the challenge pack reference docs for field-by-field authoring.