AgentClash

Release gates for agents

AI agent regression testing for CI

Agent changes can pass a demo and still get worse in production. AgentClash reruns real tasks, compares candidates against a baseline, and blocks pull requests when scorecards or evidence regress.

pull request gate
blocked

regression scorecard

Correctness-9

threshold -3

Latency+18%

threshold +15%

Cost+2%

threshold +10%

Artifactsmissing

required

replay evidence

1candidate skipped validator retry after tool timeout
2baseline produced required patch and attached artifact
3candidate final answer passed prose check but failed file check
agentclash compare gate --baseline run-stable --candidate run-pr-184

What a gate checks

Regression signals that survive model churn

Prompts, models, tools, and sandbox images all move. AgentClash makes the workload repeatable so the release decision is based on behavior, not a one-off transcript.

Baseline versus candidate scorecards

Replay timelines for every failed gate

Artifact checks for files, logs, and evidence

Cost and latency thresholds for production budgets

Challenge packs that make failures repeatable

Pull request gates for model, prompt, and tool changes

Workflow

From escaped failure to pull request gate

Freeze the workload

Turn a real failure or release risk into a challenge pack with inputs, tools, artifacts, and scoring rules.

Compare candidate to baseline

Run both agents under the same constraints and compare correctness, latency, cost, and evidence.

Block risky changes

Fail the pull request when the scorecard regresses past the release gate threshold.

Promote failures

Convert escaped failures into reusable regression cases so the same mistake stays covered.

Start with docs

Wire regression gates into the release loop

Start with one challenge pack and one release gate. Then add escaped failures as reusable cases instead of rebuilding the entire eval stack every time an agent changes.

FAQ

Questions teams ask before gating agent releases

What is AI agent regression testing?

AI agent regression testing reruns repeatable agent tasks against a candidate agent or model, compares the result to a baseline, and flags behavior that got worse before it ships.

How do AgentClash CI gates work?

AgentClash can run a challenge pack in CI, compare candidate and baseline scorecards, then fail a pull request when correctness, cost, latency, or required artifacts cross the threshold you set.

Can teams debug a failed agent gate?

Yes. Each run keeps replay evidence, tool calls, logs, artifacts, and scorecards so reviewers can see why the candidate failed instead of guessing from a final answer alone.