Public sitemap
AgentClash sitemap
Browse public AgentClash pages, AI agent evaluation resources, docs, and engineering posts.
Core pages
HomeAgentClash overview and product entry point.Agent evaluationEvaluate AI agents on real tasks with replay and scorecards.Agent regression testingCatch AI agent regressions with baseline comparisons and CI gates.DocsGuides and references for running AgentClash.BlogEngineering notes on AI agent evaluation and release gates.Why AgentClashWhy real-task AI agent evaluation matters.TeamThe engineers building AgentClash.LLMs indexMachine-readable index for AI assistants and coding agents.Full LLMs bundleComplete machine-readable AgentClash docs bundle.
Blog posts
AI Agent Evaluation Needs Regression Testing, Not Just BenchmarksA practical guide to AI agent evaluation with real-task workloads, replay evidence, scorecards, challenge packs, and CI regression gates.Why We Built AgentClashStatic benchmarks leak. Leaderboards reward hype. We built something different.
Getting Started
Concepts
Runs and EvalsUnderstand the difference between a run, a ranked result set, and the broader eval concept.Agents and DeploymentsSee how runnable agent targets are modeled before they can participate in an eval.Challenge Packs and InputsUnderstand how tasks, input sets, and scoring context are grouped into repeatable workloads.Replay and ScorecardsLearn how canonical events become timelines, evidence, and comparison-ready outputs.Tools, Network, and SecretsSee how pack-defined tools delegate to primitives, how outbound internet is controlled, and where secrets resolve.ArtifactsUnderstand stored files, pack assets, run evidence, and signed downloads.
Challenge packs
Reference overviewMap of every challenge-pack documentation page and where each topic is enforced in Go.Bundle YAML referenceTop-level bundle keys, manifests, constraints for prompt_eval versus native.Evaluation specValidators, targets, metric collectors, scorecard dimensions, strategies, post-execution captures.LLM judgesRubric, assertion, n_wise, and reference modes plus consensus keys and budgets.Tools, primitives & policyallowed_tool_kinds, built-in primitives, composed tools to http_request mocks and cycles.Sandbox & E2BPack sandbox block, outbound network CIDR lists, sandbox provider env, no-op modes.Input sets & casesCase inputs expectations artifacts legacy payloads and how payloads are persisted.Eval workflows & gatesCLI eval start baseline scorecard compare gates and regression scope flags grounded in Cobra.
Guides
Write a Challenge PackAuthor a bundle YAML file, validate it, publish it, and understand the IDs AgentClash returns.Configure Runtime ResourcesCreate secrets, provider accounts, model aliases, runtime profiles, and deployments in the order the product expects.Interpret ResultsRead timelines, scorecards, and ranking changes without getting lost in raw event volume.CI/CD Agent GatesDefine the agent revision, workload, baseline, and release gate a pull request should run.CI/CD Workload RecipesPick realistic agent CI workloads for coding, research, support, ops, and long-horizon agents.Use with AI ToolsUse llms.txt, the full bundle, and per-page markdown exports with assistants and coding agents.
Agent Skills
Skill CatalogChoose the right AgentClash skill for setup, authoring, running, reviewing, regression, or CI.Challenge Pack SkillsFocused skills for planning, YAML authoring, input sets, scoring, judges, tools, artifacts, and publication.Agent Build SkillsSkills for agent build specs, deployments, runtime resources, providers, secrets, and model aliases.CLI Setup SkillConfigure the CLI, authenticate, select workspaces, and run doctor checks.Eval Runner SkillStart, follow, and report AgentClash evals and runs with useful evidence.Scorecard Reader SkillTurn rankings, scorecards, and replay evidence into engineering findings.Regression Flywheel SkillPromote useful run failures into regression suites and verify suite-only runs.CI Release Gate SkillCompare candidates against baselines and wire AgentClash gates into CI.
Reference
Architecture
OverviewWeb, API, worker, Postgres, Temporal, sandbox, and artifact storage in one picture.OrchestrationHow API requests become Temporal workflows and how the worker executes them.Sandbox LayerWhy execution is isolated behind a provider boundary and how E2B fits today.Data ModelThe core entities behind workspaces, deployments, challenge packs, runs, and evidence.Evidence LoopHow run events, artifacts, and scorecards move from execution into replay and review.FrontendHow the Next.js app is split between public product pages, authenticated app routes, and docs.