Eval program
First governed benchmark in 2 weeks
AgentClash is the platform. Our team gets you from live agents to frozen challenge packs, baseline evidence, and CI gates in your workspace. Fixed offerings, not open-ended consulting.
These packages are paid engagements, scoped after discovery. The Team pilot is free self-serve product access if you want to start on your own.
Offerings
Four fixed packages
Each package is a paid, fixed-scope engagement that ends with customer-owned artifacts in your workspace. We quote scope on discovery; no public rate card.
Eval Discovery
1 weekAudit your agents, document 5 concrete failure modes, and deliver a prioritized challenge pack roadmap.
Best when: You have agents in production but no frozen benchmark yet.
Email about Eval DiscoveryChallenge Pack Build
2 to 4 weeksShip 3–10 custom challenge packs from your real workflows, scored and versioned in your workspace.
Best when: You know what to test and need packs built from live tasks and tools.
Email about Challenge Pack BuildBenchmark & Gate Setup
2 weeksRun a baseline, wire a CI release gate, and hand off an executive scorecard template your committee can defend.
Best when: You have packs and need a governed release decision in CI.
Email about Benchmark & Gate SetupManaged Eval Retainer
MonthlyRelease benchmarks on every ship candidate plus a monthly reliability report with regression trends.
Best when: You ship agents often and want ongoing benchmark coverage without building an eval team.
Email about Managed Eval Retainer
Guardrails
Platform adoption, not a consulting shop
- Every engagement produces artifacts in your AgentClash workspace: packs, baselines, gates, or CI handoff.
- We do not run black-box evals outside your tenancy. You own the packs and the evidence.
- Services are paid engagements. The free 45-day Team pilot is still the default path if you want to self-serve first.
- No vague SOWs. Each package above has a fixed duration and named deliverable.
Discovery intake
What we capture on the first call
A free 30-minute discovery maps your agents to the right package and quotes scope. Bring what you have; we structure the benchmark plan.
- Agent workflow and tools in scope
- Recent failure examples or incident tickets
- Compliance, residency, or policy constraints
- Target release decision and stakeholders
- Current eval or observability tooling
- Success criteria for the first governed benchmark
Explore
Product paths
- Enterprise pilot45-day Team pilot with no credit card. Self-serve product access first.
- Product pricingFree, Pro, Team, and Enterprise tiers. BYOK on every plan.
- Evaluation platformSame-tools races, sandbox execution, replay, and scorecards.
- DocumentationChallenge packs, CI gates, and self-host guides.
FAQ
Eval services questions
- How is this different from the Team pilot?
- The Team pilot is product access in your workspace. Services are fixed-scope engagements where our team builds packs, baselines, or gates with you. Many teams start the pilot and add a 2-week Benchmark & Gate Setup sprint.
- Do we keep the challenge packs after the engagement?
- Yes. All packs, baselines, scorecards, and gate configs live in your workspace. You can extend them without us.
- Can we self-host instead of hosted AgentClash?
- Yes. AgentClash is MIT-licensed. We can deliver packs and gate templates against your self-hosted stack. Discuss deployment during discovery.
- What do you need before discovery?
- A short description of the agent workflow, one or two real failure examples, and who signs the release decision. We handle the rest on the first call.
- Are services free?
- No. The four packages above are paid, fixed-scope engagements. The discovery call is free. Product access starts free on the Team pilot and paid tiers on the pricing page.
Book a discovery call
Tell us about your agents and release process. We will recommend a package and timeline, or point you to the Team pilot if that is the better first step.