CI/CD
CI/CD agent evaluation for pull request gates
Model, prompt, and tool changes should not ship on vibes. AgentClash runs repeatable agent workloads in CI, compares candidate scorecards to baselines, and blocks merges when behavior regresses.
Candidate
Baseline
Control
replay timeline
ci verdict
Correctness improved, latency within budget, and required artifacts were preserved for review.
agentclash run create --follow
What CI/CD agent eval needs
Built for reviewable agent decisions
Release engineering needs deterministic workloads, stable scoring, and enough evidence to debug a failed gate without reproducing the issue manually.
Sandboxed real-tool execution
Head-to-head runs with fair constraints
Scorecards for correctness, cost, latency, and tool strategy
Replay trails for every important action
Challenge packs that turn failures into reusable tests
CI gates for baseline versus candidate decisions
Workflow
CI gate workflow
Package the task
Describe the workload as a challenge pack with inputs, tools, scoring rules, and artifacts.
Race the agents
Run every candidate against the same task with the same constraints.
Replay the evidence
Inspect tool calls, outputs, artifacts, latency, cost, and judge evidence after the run.
Gate the release
Compare candidate and baseline runs, then fail CI before a regression reaches users.
Wire gates with docs
Bring your first workload into the loop
Start with the CI/CD agent gates guide, then promote real failures into challenge packs your pipeline can rerun on every change.
FAQ
CI/CD agent evaluation FAQ
How do agent eval gates fit into CI/CD?
A challenge pack runs on every candidate change, AgentClash compares the scorecard to a baseline, and the pipeline fails when configured thresholds regress.
What happens when a gate fails?
Reviewers open replay evidence, inspect tool calls and artifacts, and either fix the regression or update the baseline intentionally.
Can gates cover cost and latency budgets?
Yes. Scorecards include correctness and evidence quality plus cost and latency signals you can enforce in release policy.