Eval program

First governed benchmark in 2 weeks

AgentClash is the platform. Our team gets you from live agents to frozen challenge packs, baseline evidence, and CI gates in your workspace. Fixed offerings, not open-ended consulting.

These packages are paid engagements, scoped after discovery. The Team pilot is free self-serve product access if you want to start on your own.

Email hello@agentclash.dev

Offerings

Four fixed packages

Each package is a paid, fixed-scope engagement that ends with customer-owned artifacts in your workspace. We quote scope on discovery; no public rate card.

  • Eval Discovery

    1 week

    Audit your agents, document 5 concrete failure modes, and deliver a prioritized challenge pack roadmap.

    Best when: You have agents in production but no frozen benchmark yet.

    Email about Eval Discovery
  • Challenge Pack Build

    2 to 4 weeks

    Ship 3–10 custom challenge packs from your real workflows, scored and versioned in your workspace.

    Best when: You know what to test and need packs built from live tasks and tools.

    Email about Challenge Pack Build
  • Benchmark & Gate Setup

    2 weeks

    Run a baseline, wire a CI release gate, and hand off an executive scorecard template your committee can defend.

    Best when: You have packs and need a governed release decision in CI.

    Email about Benchmark & Gate Setup
  • Managed Eval Retainer

    Monthly

    Release benchmarks on every ship candidate plus a monthly reliability report with regression trends.

    Best when: You ship agents often and want ongoing benchmark coverage without building an eval team.

    Email about Managed Eval Retainer

Guardrails

Platform adoption, not a consulting shop

  • Every engagement produces artifacts in your AgentClash workspace: packs, baselines, gates, or CI handoff.
  • We do not run black-box evals outside your tenancy. You own the packs and the evidence.
  • Services are paid engagements. The free 45-day Team pilot is still the default path if you want to self-serve first.
  • No vague SOWs. Each package above has a fixed duration and named deliverable.

Discovery intake

What we capture on the first call

A free 30-minute discovery maps your agents to the right package and quotes scope. Bring what you have; we structure the benchmark plan.

  • Agent workflow and tools in scope
  • Recent failure examples or incident tickets
  • Compliance, residency, or policy constraints
  • Target release decision and stakeholders
  • Current eval or observability tooling
  • Success criteria for the first governed benchmark

FAQ

Eval services questions

How is this different from the Team pilot?
The Team pilot is product access in your workspace. Services are fixed-scope engagements where our team builds packs, baselines, or gates with you. Many teams start the pilot and add a 2-week Benchmark & Gate Setup sprint.
Do we keep the challenge packs after the engagement?
Yes. All packs, baselines, scorecards, and gate configs live in your workspace. You can extend them without us.
Can we self-host instead of hosted AgentClash?
Yes. AgentClash is MIT-licensed. We can deliver packs and gate templates against your self-hosted stack. Discuss deployment during discovery.
What do you need before discovery?
A short description of the agent workflow, one or two real failure examples, and who signs the release decision. We handle the rest on the first call.
Are services free?
No. The four packages above are paid, fixed-scope engagements. The discovery call is free. Product access starts free on the Team pilot and paid tiers on the pricing page.

Book a discovery call

Tell us about your agents and release process. We will recommend a package and timeline, or point you to the Team pilot if that is the better first step.

Email hello@agentclash.dev
Agent Evaluation Services — Fixed Offerings | AgentClash