Public sitemap
AgentClash sitemap
Browse public AgentClash pages, AI agent evaluation resources, docs, and engineering posts.
Core pages
HomeAgentClash overview and product entry point.Agent evaluationEvaluate AI agents on real tasks with replay and scorecards.Agent regression testingCatch AI agent regressions with baseline comparisons and CI gates.Use casesCoding, research, and support agent evaluation use cases.FeaturesScorecards, replay, and challenge packs for agent evaluation.IndustriesBanking, insurance, and government agent evaluation playbooks.GlossaryDefinitions for agent evaluation, challenge packs, and release gates.DocsGuides and references for running AgentClash.BlogEngineering notes on AI agent evaluation and release gates.ChangelogProduct updates shipped every ten days since launch.Why AgentClashWhy real-task AI agent evaluation matters.TeamThe engineers building AgentClash.LLMs indexMachine-readable index for AI assistants and coding agents.Full LLMs bundleComplete machine-readable AgentClash docs bundle.
SEO landing pages
Open source AI agent evaluationMIT-licensed, self-hostable AI agent evaluation with replay, scorecards, and CI gates.Agent evalsReal-task agent evals with replay evidence, scorecards, and CI regression gates.LLM agent evaluationEvaluate tool-using LLM agents on real tasks with replay and scorecards.Agent evaluation frameworkFramework for real-task agent evaluation with replay, scorecards, and CI gates.AI agent testingTest AI agents on real tasks with replay evidence and CI regression gates.Agent trajectory evaluationScore full agent trajectories with replay evidence and release gates.CI/CD agent evaluationRun agent evaluation in CI/CD with scorecard gates and replay evidence.AI agent benchmarkBenchmark AI agents on real tasks with replay and scorecards.Agent reliability benchmarkBenchmark agent reliability on real tasks with regression gates.Coding agent evaluationEvaluate coding agents on real repositories with replay and CI gates.Research agent evaluationEvaluate research agents on investigation tasks with replay and scorecards.Support agent evaluationEvaluate support agents on resolution workflows with replay and gates.Banking agent evaluationEvaluate financial services agents with replay evidence and release gates.Insurance agent evaluationEvaluate insurance support agents with policy checks and replay.Government agent evaluationEvaluate public-sector agents with replay and evidence bundles.Agent evaluation (glossary)Definition of agent evaluation vs prompt testing.Challenge pack (glossary)Definition of AgentClash challenge packs.Release gate (glossary)Definition of agent release gates and CI regression checks.Agent scorecardsScorecards for correctness, cost, latency, and evidence quality.Agent replayReplay tool calls, artifacts, and evidence for agent debugging.Challenge packsRepeatable agent evaluation workloads with scoring and CI gates.
Blog posts
Evaluating Bilingual Customer Support Agents: Arabic, English, and Release EvidenceBilingual support agents need evals on real ticket flows in each language, not a single translated prompt. How to build Arabic and English cases with replay and gates.AI Agent Governance for Middle East Enterprises: Residency, Evidence, and Release GatesUAE and GCC buyers need clear separation between data residency, sovereign cloud, and agent release evidence. A governance checklist grounded in repeatable eval workflows.Building an Agent Eval Program in a Regulated EnterpriseRegulated enterprises need versioned workloads, audit-friendly replay, baseline gates, and CI handoff. A phased plan to stand up a governed agent eval program.When the US Government Banned Fable 5Three days after launch, the US Commerce Department ordered Anthropic to pull its best model. A model's availability is now a jurisdictional variable, and that changes how everyone should build.Why Your AI Pilot Failed (and How Eval Fixes the Second Attempt)Most AI agent pilots fail on workload mismatch, missing baselines, and unreviewable evidence. How to redesign the second attempt with governed benchmarks and replay.Agent Evaluation vs Prompt Evaluation: When Braintrust Isn't EnoughPrompt evaluation tools like Braintrust excel at scoring model outputs. Production agents need trajectory evals on real tasks: here's how to know when to add sandboxed agent testing.The AI Platform Lead's Guide to Agent Release GatesRelease gates for AI agents need frozen workloads, baseline comparison, replay evidence, and CI handoff. A guide for platform leads shipping governed agent systems.Coding agent benchmark — June 2026Our first measured public benchmark: four GPT generations on a real coding task with frozen challenge packs, full trajectory scoring, and replay evidence. Methodology, scoreboard, and reproduction steps.AgentClash product updates — June 2026Monthly rollup of AgentClash product updates from May and June 2026: datasets, security packs, skills distribution, and the first public coding-agent benchmark. Full timeline on the changelog.How to Get AI Agent Approval from Security and ComplianceSecurity and compliance teams need replay evidence, frozen workloads, and gate verdicts before they approve an AI agent. A practical approval checklist for platform leads.How to Benchmark AI Agents on Your Own Data (Not Public Leaderboards)Public leaderboards answer a general model question. Shipping agents requires benchmarks on your workflows, tools, and failure modes: a practical playbook.pass@k, pass^k, and Reliability: What Enterprise Teams Should MeasureEnterprise agent releases need explicit reliability metrics. A practical guide to pass@k and pass^k for platform teams, extending the pass@k vs pass^k primer with release-gate examples.Evaluating Coding Agents on Private Repos: A Practical ChecklistCoding agents on private repositories need evals that respect your layout, tests, and policies (not public leaderboard tasks). A checklist for platform and developer-experience teams.AgentClash vs LangSmith vs Braintrust for Production Agent TestingLangSmith traces chains, Braintrust scores prompts, AgentClash races tool-using agents on real tasks with replay and CI gates: an honest workflow comparison for production agent testing.I Tried to Fingerprint How AI Agents Cheat — and the Brand Didn't MatterA small pilot on AgentClash: 8 frontier models, 6 kinds of test-gaming, 192 runs. The surprise wasn't which lab built the model — it was that how an agent cheats tracks its capability, not its provider. And one deleted comment made the cheating disappear.pass@k vs pass^k: What Agent Reliability Metrics Actually MeasureA practical guide to pass@k and pass^k for agent evaluation — when independent retries and strict success-over-trials measure different kinds of reliability.Why AgentClash Races Agents Head-to-HeadStaggered, one-at-a-time evals compare runs that never faced the same conditions. AgentClash runs every model on the same task at the same time on the same budget — here is why concurrent racing produces a fairer verdict.How AgentClash Scores Agent TrajectoriesFinal-answer grading misses how an agent got there. Here is how AgentClash scores the whole trajectory with deterministic checks, numeric metrics, behavioral signals, and LLM judges — then aggregates them into one defensible verdict.AI Agent Evaluation Needs Regression Testing, Not Just BenchmarksA practical guide to AI agent evaluation with real-task workloads, replay evidence, scorecards, challenge packs, and CI regression gates.Why We Built AgentClashStatic benchmarks leak. Leaderboards reward hype. We built something different.
Getting Started
Concepts
Runs and EvalsUnderstand the difference between a run, a ranked result set, and the broader eval concept.Agents and DeploymentsSee how runnable agent targets are modeled before they can participate in an eval.Challenge Packs and InputsUnderstand how tasks, input sets, and scoring context are grouped into repeatable workloads.Replay and ScorecardsLearn how canonical events become timelines, evidence, and comparison-ready outputs.Tools, Network, and SecretsSee how pack-defined tools delegate to primitives, how outbound internet is controlled, and where secrets resolve.ArtifactsUnderstand stored files, pack assets, run evidence, and signed downloads.Try CLIInteractive disposable terminal demos for README badges — try CLIs before install.Voice Artifact ContractsUse generic audio, timing, sync, and media reports to evaluate voice agents across providers.
Challenge packs
Reference overviewMap of every challenge-pack documentation page and where each topic is enforced in Go.Bundle YAML referenceTop-level bundle keys, manifests, constraints for prompt_eval versus native.Evaluation specValidators, targets, metric collectors, scorecard dimensions, strategies, post-execution captures.LLM judgesRubric, assertion, n_wise, and reference modes plus consensus keys and budgets.Tools, primitives & policyallowed_tool_kinds, built-in primitives, composed tools to http_request mocks and cycles.Sandbox & E2BPack sandbox block, outbound network CIDR lists, sandbox provider env, no-op modes.Input sets & casesCase inputs expectations artifacts legacy payloads and how payloads are persisted.Multi-turn packsScripted, LLM, and human user-simulator phases with operator APIs and calibration reviews.Eval workflows & gatesCLI eval start baseline scorecard compare gates and regression scope flags grounded in Cobra.
Guides
Write a Challenge PackAuthor a bundle YAML file, validate it, publish it, and understand the IDs AgentClash returns.Configure Runtime ResourcesCreate secrets, provider accounts, model aliases, runtime profiles, and deployments in the order the product expects.Interpret ResultsRead timelines, scorecards, and ranking changes without getting lost in raw event volume.CI/CD Agent GatesDefine the agent revision, workload, baseline, and release gate a pull request should run.Datasets overviewPinned dataset evals, baselines, regression suite sync, and the dataset command lifecycle.Dataset CI GatesRecord dataset eval baselines, sync examples into regression suites, and gate CI with agentclash dataset test.Security evaluationStress-run security packs, score secret leaks and adversarial acceptance, and measure posture.CI/CD Workload RecipesPick realistic agent CI workloads for coding, research, support, ops, and long-horizon agents.Use with AI ToolsUse llms.txt, the full bundle, and per-page markdown exports with assistants and coding agents.
Agent Skills
Skill CatalogChoose the right AgentClash skill for setup, authoring, running, reviewing, regression, or CI.Hub SkillStart here for workflow map, dependency order, UI links, and pointers to every AgentClash skill.Quickstart SkillRun readiness checks and get the suggested next CLI command before evals.Challenge Pack SkillsFocused skills for planning, YAML authoring, input sets, scoring, judges, tools, artifacts, and publication.Agent Build SkillsSkills for agent build specs, deployments, runtime resources, providers, secrets, and model aliases.CLI Setup SkillConfigure the CLI, authenticate, select workspaces, and run doctor checks.Eval Runner SkillStart, follow, and report AgentClash evals and runs with useful evidence.Scorecard Reader SkillTurn rankings, scorecards, and replay evidence into engineering findings.Compare And Triage SkillManage baselines, compare runs, evaluate gates, and build replay triage envelopes.Regression Flywheel SkillPromote useful run failures into regression suites and verify suite-only runs.CI Release Gate SkillCompare candidates against baselines and wire AgentClash gates into CI.Agent Harness Setup SkillCreate and run Agent Harness coding tasks, suites, executions, and failure review.Multi Turn Operator SkillSubmit human operator messages when multi_turn run agents await input.Dataset Workflows SkillManage datasets, eval gates, synthetic generation, traces, and regression sync.Prompt Eval Playground SkillScaffold, validate, and run prompt eval configs and playground experiments.Workspace Admin SkillAdminister organizations, workspaces, and membership beyond basic CLI login.Security Evaluation SkillRun client-side security stress harnesses against security challenge packs.
Reference
Architecture
OverviewWeb, API, worker, Postgres, Temporal, sandbox, and artifact storage in one picture.OrchestrationHow API requests become Temporal workflows and how the worker executes them.Sandbox LayerWhy execution is isolated behind a provider boundary and how E2B fits today.Data ModelThe core entities behind workspaces, deployments, challenge packs, runs, and evidence.Evidence LoopHow run events, artifacts, and scorecards move from execution into replay and review.FrontendHow the Next.js app is split between public product pages, authenticated app routes, and docs.