# AgentClash

> AgentClash runs agents against repeatable challenge packs, captures replay evidence, and shows where a run won, failed, or drifted.

Use this index when you want the shortest machine-readable map of the public docs and selected product pages. Fetch `/llms-full.txt` for the bundled corpus, or use the `/docs-md/...` links below for page-level markdown exports.

## Core entrypoints

- [Docs home](https://www.agentclash.dev/docs-md) - overview, navigation, and starting points.
- [Quickstart](https://www.agentclash.dev/docs-md/getting-started/quickstart) - fastest path to a real run.
- [Self-Host](https://www.agentclash.dev/docs-md/getting-started/self-host) - local stack and service dependencies.
- [First Eval](https://www.agentclash.dev/docs-md/getting-started/first-eval) - end-to-end walkthrough of one eval path.
- [CLI Reference](https://www.agentclash.dev/docs-md/reference/cli) - generated command reference.
- [Config Reference](https://www.agentclash.dev/docs-md/reference/config) - generated environment and precedence reference.
- [Agent Skills](https://www.agentclash.dev/docs-md/agent-skills) - copyable AgentClash skills for coding agents.
- [Full bundle](https://www.agentclash.dev/llms-full.txt) - all shipped docs in one file.

## Public product pages

- [AI Agent Evaluation Platform](https://www.agentclash.dev/platform/agent-evaluation) - Public page for real-task AI agent evaluation, replay evidence, scorecards, challenge packs, and CI regression gates.
- [AI Agent Regression Testing](https://www.agentclash.dev/platform/agent-regression-testing) - Public page for baseline-versus-candidate agent regression testing, pull request gates, and release evidence.

## Blog posts

- [AI Agent Evaluation Needs Regression Testing, Not Just Benchmarks](https://www.agentclash.dev/blog/ai-agent-evaluation-regression-testing) - A practical guide to AI agent evaluation with real-task workloads, replay evidence, scorecards, challenge packs, and CI regression gates.
- [Why We Built AgentClash](https://www.agentclash.dev/blog/why-we-built-agentclash) - Static benchmarks leak. Leaderboards reward hype. We built something different.

## Getting Started

- [Quickstart](https://www.agentclash.dev/docs-md/getting-started/quickstart) - Use the hosted backend and validate auth, workspace access, and run creation.
- [Self-Host](https://www.agentclash.dev/docs-md/getting-started/self-host) - Bring up the local stack with Postgres, Temporal, API server, worker, and web app.
- [First Eval](https://www.agentclash.dev/docs-md/getting-started/first-eval) - Walk through the current happy path from seeded data to live run events and ranking output.

## Concepts

- [Runs and Evals](https://www.agentclash.dev/docs-md/concepts/runs-and-evals) - Understand the difference between a run, a ranked result set, and the broader eval concept.
- [Agents and Deployments](https://www.agentclash.dev/docs-md/concepts/agents-and-deployments) - See how runnable agent targets are modeled before they can participate in an eval.
- [Challenge Packs and Inputs](https://www.agentclash.dev/docs-md/concepts/challenge-packs-and-inputs) - Understand how tasks, input sets, and scoring context are grouped into repeatable workloads.
- [Replay and Scorecards](https://www.agentclash.dev/docs-md/concepts/replay-and-scorecards) - Learn how canonical events become timelines, evidence, and comparison-ready outputs.
- [Tools, Network, and Secrets](https://www.agentclash.dev/docs-md/concepts/tools-network-and-secrets) - See how pack-defined tools delegate to primitives, how outbound internet is controlled, and where secrets resolve.
- [Artifacts](https://www.agentclash.dev/docs-md/concepts/artifacts) - Understand stored files, pack assets, run evidence, and signed downloads.

## Challenge packs

- [Reference overview](https://www.agentclash.dev/docs-md/challenge-packs) - Map of every challenge-pack documentation page and where each topic is enforced in Go.
- [Bundle YAML reference](https://www.agentclash.dev/docs-md/challenge-packs/bundle-yaml-reference) - Top-level bundle keys, manifests, constraints for prompt_eval versus native.
- [Evaluation spec](https://www.agentclash.dev/docs-md/challenge-packs/evaluation-spec-reference) - Validators, targets, metric collectors, scorecard dimensions, strategies, post-execution captures.
- [LLM judges](https://www.agentclash.dev/docs-md/challenge-packs/llm-judges) - Rubric, assertion, n_wise, and reference modes plus consensus keys and budgets.
- [Tools, primitives & policy](https://www.agentclash.dev/docs-md/challenge-packs/tools-primitives-and-policy) - allowed_tool_kinds, built-in primitives, composed tools to http_request mocks and cycles.
- [Sandbox & E2B](https://www.agentclash.dev/docs-md/challenge-packs/sandbox-and-e2b) - Pack sandbox block, outbound network CIDR lists, sandbox provider env, no-op modes.
- [Input sets & cases](https://www.agentclash.dev/docs-md/challenge-packs/input-sets-and-cases) - Case inputs expectations artifacts legacy payloads and how payloads are persisted.
- [Eval workflows & gates](https://www.agentclash.dev/docs-md/challenge-packs/eval-workflows-and-gates) - CLI eval start baseline scorecard compare gates and regression scope flags grounded in Cobra.

## Guides

- [Write a Challenge Pack](https://www.agentclash.dev/docs-md/guides/write-a-challenge-pack) - Author a bundle YAML file, validate it, publish it, and understand the IDs AgentClash returns.
- [Configure Runtime Resources](https://www.agentclash.dev/docs-md/guides/configure-runtime-resources) - Create secrets, provider accounts, model aliases, runtime profiles, and deployments in the order the product expects.
- [Interpret Results](https://www.agentclash.dev/docs-md/guides/interpret-results) - Read timelines, scorecards, and ranking changes without getting lost in raw event volume.
- [CI/CD Agent Gates](https://www.agentclash.dev/docs-md/guides/ci-cd-agent-gates) - Define the agent revision, workload, baseline, and release gate a pull request should run.
- [CI/CD Workload Recipes](https://www.agentclash.dev/docs-md/guides/ci-cd-workload-recipes) - Pick realistic agent CI workloads for coding, research, support, ops, and long-horizon agents.
- [Use with AI Tools](https://www.agentclash.dev/docs-md/guides/use-with-ai-tools) - Use llms.txt, the full bundle, and per-page markdown exports with assistants and coding agents.

## Agent Skills

- [Skill Catalog](https://www.agentclash.dev/docs-md/agent-skills) - Choose the right AgentClash skill for setup, authoring, running, reviewing, regression, or CI.
- [Challenge Pack Skills](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills) - Focused skills for planning, YAML authoring, input sets, scoring, judges, tools, artifacts, and publication.
- [Agent Build Skills](https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills) - Skills for agent build specs, deployments, runtime resources, providers, secrets, and model aliases.
- [CLI Setup Skill](https://www.agentclash.dev/docs-md/agent-skills/agentclash-cli-setup) - Configure the CLI, authenticate, select workspaces, and run doctor checks.
- [Eval Runner Skill](https://www.agentclash.dev/docs-md/agent-skills/agentclash-eval-runner) - Start, follow, and report AgentClash evals and runs with useful evidence.
- [Scorecard Reader Skill](https://www.agentclash.dev/docs-md/agent-skills/agentclash-scorecard-reader) - Turn rankings, scorecards, and replay evidence into engineering findings.
- [Regression Flywheel Skill](https://www.agentclash.dev/docs-md/agent-skills/agentclash-regression-flywheel) - Promote useful run failures into regression suites and verify suite-only runs.
- [CI Release Gate Skill](https://www.agentclash.dev/docs-md/agent-skills/agentclash-ci-release-gate) - Compare candidates against baselines and wire AgentClash gates into CI.

## Reference

- [CLI](https://www.agentclash.dev/docs-md/reference/cli) - Commands, flags, and command groups generated from the Cobra source tree.
- [Config](https://www.agentclash.dev/docs-md/reference/config) - Current environment surface pulled from the API, worker, CLI, and example config sources.

## Architecture

- [Overview](https://www.agentclash.dev/docs-md/architecture/overview) - Web, API, worker, Postgres, Temporal, sandbox, and artifact storage in one picture.
- [Orchestration](https://www.agentclash.dev/docs-md/architecture/orchestration) - How API requests become Temporal workflows and how the worker executes them.
- [Sandbox Layer](https://www.agentclash.dev/docs-md/architecture/sandbox-layer) - Why execution is isolated behind a provider boundary and how E2B fits today.
- [Data Model](https://www.agentclash.dev/docs-md/architecture/data-model) - The core entities behind workspaces, deployments, challenge packs, runs, and evidence.
- [Evidence Loop](https://www.agentclash.dev/docs-md/architecture/evidence-loop) - How run events, artifacts, and scorecards move from execution into replay and review.
- [Frontend](https://www.agentclash.dev/docs-md/architecture/frontend) - How the Next.js app is split between public product pages, authenticated app routes, and docs.

## Contributing

- [Setup](https://www.agentclash.dev/docs-md/contributing/setup) - Clone the repo, boot the local stack, and choose the fastest dev loop for your task.
- [Codebase Tour](https://www.agentclash.dev/docs-md/contributing/codebase-tour) - Map the top-level modules before you start changing APIs, workflows, or the web app.
- [Testing](https://www.agentclash.dev/docs-md/contributing/testing) - Pick the smallest useful validation loop and use review checkpoints for scoped changes.

## Agent Skill Pages

- [agentclash-agent-build-author](https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills/agentclash-agent-build-author) - Use when creating, editing, validating, or readying AgentClash agent builds and build versions, including agent identity, spec JSON, prompts, model/runtime expectations, tool bindings, and version readiness.
- [agentclash-agent-deployment-setup](https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills/agentclash-agent-deployment-setup) - Use when creating, selecting, or diagnosing AgentClash agent deployments for runs, including ready build versions, runtime profiles, provider/model wiring, deployment IDs, workspace context, and run compatibility.
- [agentclash-runtime-resources-setup](https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills/agentclash-runtime-resources-setup) - Use when configuring AgentClash workspace secrets, provider accounts, model catalog entries, model aliases, runtime profiles, workspace tools, and readiness checks required before agent builds, deployments, evals, or runs.
- [agentclash-ci-release-gate](https://www.agentclash.dev/docs-md/agent-skills/agentclash-ci-release-gate) - Use when wiring AgentClash manifest-based CI gates, deciding whether a PR should run AgentClash, resolving baselines, running `agentclash ci run`, interpreting gate exit codes, collecting CI artifacts, or configuring regression promotion policy in GitHub Actions.
- [agentclash-cli-setup](https://www.agentclash.dev/docs-md/agent-skills/agentclash-cli-setup) - Use when configuring the AgentClash CLI, authenticating with device login or tokens, selecting a workspace, saving default config with link, creating project config with init, resolving API URL precedence, or diagnosing CLI access against production, local, or self-hosted backends.
- [agentclash-eval-runner](https://www.agentclash.dev/docs-md/agent-skills/agentclash-eval-runner) - Use when starting, following, inspecting, or reporting AgentClash eval runs with the CLI, especially eval start, run create, deployment selection, input set selection, suite-only scopes, repetitions, events, rankings, failures, and scorecards.
- [agentclash-regression-flywheel](https://www.agentclash.dev/docs-md/agent-skills/agentclash-regression-flywheel) - Use when inspecting AgentClash run failure-review items, promoting useful failures into regression suites, editing regression suites or cases, and verifying suite-only reruns.
- [agentclash-scorecard-reader](https://www.agentclash.dev/docs-md/agent-skills/agentclash-scorecard-reader) - Use when interpreting AgentClash rankings, scorecards, replay timelines, artifacts, LLM judge results, or failure-review evidence into source-backed findings and next actions.
- [agentclash-challenge-pack-artifacts](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-artifacts) - Use when specifying AgentClash challenge pack assets, artifact references, produced file captures, evidence references, artifact upload/download expectations, and review-only evidence.
- [agentclash-challenge-pack-input-sets](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-input-sets) - Use when designing AgentClash challenge pack cases and input sets for smoke, full benchmark, regression, edge-case, or CI suite-only coverage.
- [agentclash-challenge-pack-llm-judges](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-llm-judges) - Use when configuring AgentClash LLM-as-judge scoring, judge prompts, rubrics, assertion/reference/n-wise modes, evidence inputs, scorecard dimensions, abstention behavior, and judge result interpretation.
- [agentclash-challenge-pack-planner](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-planner) - Use when turning a vague AgentClash evaluation idea into a source-backed challenge pack plan with task boundaries, target agents, cases, input sets, scoring strategy, tools, artifacts, runtime policy, validation criteria, and handoff steps.
- [agentclash-challenge-pack-scoring-validators](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-scoring-validators) - Use when defining deterministic AgentClash scoring validators, scorecard dimensions, evidence sources, pass/fail rules, numeric metrics, file checks, and validator result interpretation.
- [agentclash-challenge-pack-tools-sandbox](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-tools-sandbox) - Use when defining AgentClash challenge pack tool access, sandbox runtime needs, filesystem expectations, network policy, command execution, and secret references.
- [agentclash-challenge-pack-validation-publish](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-validation-publish) - Use when validating AgentClash challenge pack YAML, fixing schema/scoring/tool/asset errors, publishing runnable pack versions, recording returned IDs, and preparing next eval commands.
- [agentclash-challenge-pack-yaml-author](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-yaml-author) - Use when writing or editing AgentClash challenge pack YAML, including pack/version metadata, execution mode, challenges, cases, input sets, scoring blocks, tools, sandbox settings, assets, and validation handoff.