Feature

Challenge packs for repeatable agent evaluation

Challenge packs are AgentClash's unit of agent evaluation: a real task, tool policy, scoring rules, and artifacts encoded once so every model or harness change reruns the same workload.

Run your first eval Read quickstart

live eval

gate: pass

Candidate

92correct patch, low cost

Baseline

88stable reference run

Control

73missed edge case

replay timeline

1loaded task inputs and tool policy

2ran sandbox actions and captured artifacts

3scored trajectory and validator evidence

4attached scorecard and release verdict

ci verdict

Candidate clears release gate

Correctness improved, latency within budget, and required artifacts were preserved for review.

agentclash run create --follow

What challenge packs encode

Built for reviewable agent decisions

Inputs, sandbox resources, allowed tools, validators, judges, and pass conditions — everything needed for a fair, repeatable agent eval.

OpenTelemetry-compatible trace import

Pinned datasets and golden test cases

Baseline versus candidate regression checks

Replay trails for tool calls, outputs, and artifacts

Scorecards for correctness, cost, latency, and evidence

CI gates for prompt, model, RAG, and tool changes

Workflow

Pack lifecycle

Import the evidence

Start from OpenTelemetry traces, curated datasets, support transcripts, or a real failure your team already saw.

Pin the baseline

Record the current accepted behavior so every prompt, model, RAG, or tool change has a fair comparison point.

Replay the evidence

Inspect tool calls, outputs, artifacts, latency, cost, and judge evidence when a candidate gets worse.

Gate the release

Compare candidate and baseline runs, then fail CI before a regression reaches users.

Author packs with docs

Bring your first workload into the loop

Read the challenge pack docs and authoring guide, then promote escaped failures into packs your whole team can run.

Agent evals

Real-task agent evals with replay evidence and CI gates.

LLM agent evaluation

Evaluate LLM agents on full trajectories, not one-shot answers.

Compare tools

See how AgentClash differs from prompt-eval platforms.

Challenge packs docs

Overview of pack structure, scoring, and execution modes.

Write a challenge pack

Step-by-step authoring guide for your first pack.

Datasets overview

Import examples, record baselines, sync regression suites, and gate CI.

Dataset CI gates

Fail builds when a candidate regresses against a pinned baseline.

FAQ

Challenge packs FAQ

What is a challenge pack?

A challenge pack is a versioned agent evaluation workload with inputs, tool policy, scoring rules, and expected artifacts.

Can challenge packs run locally and in CI?

Yes. The same pack can power exploratory evals, hosted runs, and pull request gates.

How do teams create challenge packs?

Start from a real failure or release risk, encode it as YAML, and iterate with replay evidence until the scoring rules match what reviewers expect.

Run your first eval View on GitHub