AgentClash

Agent testing

AI agent testing that looks like release engineering

AI agent testing should feel closer to software testing than prompt tweaking. AgentClash turns real failures into repeatable challenge packs, compares candidates to baselines, and keeps the evidence reviewers need.

live race
gate: pass

Candidate

92correct patch, low cost

Baseline

88stable reference run

Control

73missed edge case

replay timeline

1loaded task inputs and tool policy
2ran sandbox actions and captured artifacts
3scored trajectory and validator evidence
4attached scorecard and release verdict

ci verdict

Candidate clears release gate

Correctness improved, latency within budget, and required artifacts were preserved for review.

agentclash run create --follow

What production-grade agent testing covers

Built for reviewable agent decisions

Testing agents means checking behavior across the whole run — not approving a single polished answer from a cherry-picked prompt.

Sandboxed real-tool execution

Head-to-head runs with fair constraints

Scorecards for correctness, cost, latency, and tool strategy

Replay trails for every important action

Challenge packs that turn failures into reusable tests

CI gates for baseline versus candidate decisions

Workflow

Test loop

Package the task

Describe the workload as a challenge pack with inputs, tools, scoring rules, and artifacts.

Race the agents

Run every candidate against the same task with the same constraints.

Replay the evidence

Inspect tool calls, outputs, artifacts, latency, cost, and judge evidence after the run.

Gate the release

Compare candidate and baseline runs, then fail CI before a regression reaches users.

Start testing with docs

Bring your first workload into the loop

Write a challenge pack, run a race, inspect replay, then wire the same workload into CI.

FAQ

AI agent testing FAQ

Is AI agent testing different from LLM testing?

Yes. LLM testing often means scoring one response. AI agent testing evaluates plans, tool calls, artifacts, recovery behavior, and whether the task actually finished.

Can non-ML engineers review agent test failures?

Yes. Replay timelines, artifacts, and scorecards are designed for reviewers who need to understand what changed without reading raw model traces alone.

How do we keep tests from getting stale?

Promote escaped production failures into challenge packs and regression suites so the same mistake stays covered after the next model swap.