Trajectories
Agent trajectory evaluation with reviewable evidence
The final answer is not enough. AgentClash evaluates the trajectory — tool choices, observations, retries, artifacts, and stop conditions — then preserves replay evidence for auditors and release owners.
Candidate
Baseline
Control
replay timeline
ci verdict
Correctness improved, latency within budget, and required artifacts were preserved for review.
agentclash run create --follow
Why trajectories matter
Built for reviewable agent decisions
Two agents can return the same answer while taking wildly different paths. Trajectory evaluation catches unsafe shortcuts, runaway loops, and brittle tool strategies.
Sandboxed real-tool execution
Head-to-head runs with fair constraints
Scorecards for correctness, cost, latency, and tool strategy
Replay trails for every important action
Challenge packs that turn failures into reusable tests
CI gates for baseline versus candidate decisions
Workflow
Trajectory eval workflow
Package the task
Describe the workload as a challenge pack with inputs, tools, scoring rules, and artifacts.
Race the agents
Run every candidate against the same task with the same constraints.
Replay the evidence
Inspect tool calls, outputs, artifacts, latency, cost, and judge evidence after the run.
Gate the release
Compare candidate and baseline runs, then fail CI before a regression reaches users.
Inspect runs with docs
Bring your first workload into the loop
Use replay and scorecards to debug trajectory regressions, then encode the workload as a challenge pack for CI.
FAQ
Trajectory evaluation FAQ
What is agent trajectory evaluation?
Trajectory evaluation scores the sequence of actions and observations an agent took to complete a task, not just the final output string.
How does AgentClash store trajectory evidence?
Each run keeps replay events, tool calls, logs, artifacts, and scorecards so reviewers can reconstruct the path that produced the result.
Can trajectory evals gate releases?
Yes. Compare candidate and baseline trajectories via scorecards, then fail CI when correctness, cost, latency, or evidence quality regresses.