Glossary
What is agent evaluation?
Agent evaluation measures whether an AI agent completes a real task correctly under constraint. Unlike prompt tests, it scores the whole trajectory: tools, artifacts, cost, latency, and evidence quality.
Candidate
Baseline
Control
replay timeline
ci verdict
Correctness improved, latency within budget, and required artifacts were preserved for review.
agentclash run create --follow
How agent evaluation differs
Built for reviewable agent decisions
Prompt eval checks text from one call. Agent evaluation reruns multi-step work in a sandbox and preserves replay when something fails.
Sandboxed real-tool execution
Head-to-head runs with fair constraints
Scorecards for correctness, cost, latency, and tool strategy
Replay trails for every important action
Challenge packs that turn failures into reusable tests
CI gates for baseline versus candidate decisions
Workflow
Typical eval workflow
Package the task
Describe the workload as a challenge pack with inputs, tools, scoring rules, and artifacts.
Race the agents
Run every candidate against the same task with the same constraints.
Replay the evidence
Inspect tool calls, outputs, artifacts, latency, cost, and judge evidence after the run.
Gate the release
Compare candidate and baseline runs, then fail CI before a regression reaches users.
Go deeper
Bring your first workload into the loop
Read the platform overview, then author a challenge pack for your first repeatable eval.
Agent evals
Real-task agent evals with replay evidence and CI gates.
LLM agent evaluation
Evaluate LLM agents on full trajectories, not one-shot answers.
Compare tools
See how AgentClash differs from prompt-eval platforms.
Agent evaluation platform
Product overview for real-task eval.
Glossary index
More AgentClash terms.
Quickstart
Validate the CLI and get to your first runnable command.
Write a challenge pack
Turn a real task into a repeatable agent evaluation.
FAQ
Agent evaluation FAQ
Is agent evaluation the same as LLM benchmarking?
Benchmarks compare models on fixed tasks. Agent evaluation also covers your prompts, tools, harness, and release gates on workloads you own.
What outputs does an agent evaluation produce?
A scorecard, replay of the trajectory, artifacts from the run, and a pass or fail against validators and gates you define.
Where should teams start?
Promote one escaped failure into a challenge pack, establish a baseline run, then compare the next candidate in CI or a benchmark race.