# AgentClash > AgentClash runs agents against repeatable challenge packs, captures replay evidence, and shows where a run won, failed, or drifted. Use this index when you want the shortest machine-readable map of the public docs and selected product pages. Fetch `/llms-full.txt` for the bundled corpus, or use the `/docs-md/...` links below for page-level markdown exports. ## Core entrypoints - [Docs home](https://www.agentclash.dev/docs-md) - overview, navigation, and starting points. - [Quickstart](https://www.agentclash.dev/docs-md/getting-started/quickstart) - fastest path to a real run. - [Self-Host](https://www.agentclash.dev/docs-md/getting-started/self-host) - local stack and service dependencies. - [First Eval](https://www.agentclash.dev/docs-md/getting-started/first-eval) - end-to-end walkthrough of one eval path. - [CLI Reference](https://www.agentclash.dev/docs-md/reference/cli) - generated command reference. - [Config Reference](https://www.agentclash.dev/docs-md/reference/config) - generated environment and precedence reference. - [Agent Skills](https://www.agentclash.dev/docs-md/agent-skills) - copyable AgentClash skills for coding agents. - [Full bundle](https://www.agentclash.dev/llms-full.txt) - all shipped docs in one file. ## Public product pages - [AI Agent Evaluation Platform](https://www.agentclash.dev/platform/agent-evaluation) - Public page for real-task AI agent evaluation, replay evidence, scorecards, challenge packs, and CI regression gates. - [AI Agent Regression Testing](https://www.agentclash.dev/platform/agent-regression-testing) - Public page for baseline-versus-candidate agent regression testing, pull request gates, and release evidence. ## Blog posts - [AI Agent Evaluation Needs Regression Testing, Not Just Benchmarks](https://www.agentclash.dev/blog/ai-agent-evaluation-regression-testing) - A practical guide to AI agent evaluation with real-task workloads, replay evidence, scorecards, challenge packs, and CI regression gates. - [Why We Built AgentClash](https://www.agentclash.dev/blog/why-we-built-agentclash) - Static benchmarks leak. Leaderboards reward hype. We built something different. ## Getting Started - [Quickstart](https://www.agentclash.dev/docs-md/getting-started/quickstart) - Use the hosted backend and validate auth, workspace access, and run creation. - [Self-Host](https://www.agentclash.dev/docs-md/getting-started/self-host) - Bring up the local stack with Postgres, Temporal, API server, worker, and web app. - [First Eval](https://www.agentclash.dev/docs-md/getting-started/first-eval) - Walk through the current happy path from seeded data to live run events and ranking output. ## Concepts - [Runs and Evals](https://www.agentclash.dev/docs-md/concepts/runs-and-evals) - Understand the difference between a run, a ranked result set, and the broader eval concept. - [Agents and Deployments](https://www.agentclash.dev/docs-md/concepts/agents-and-deployments) - See how runnable agent targets are modeled before they can participate in an eval. - [Challenge Packs and Inputs](https://www.agentclash.dev/docs-md/concepts/challenge-packs-and-inputs) - Understand how tasks, input sets, and scoring context are grouped into repeatable workloads. - [Replay and Scorecards](https://www.agentclash.dev/docs-md/concepts/replay-and-scorecards) - Learn how canonical events become timelines, evidence, and comparison-ready outputs. - [Tools, Network, and Secrets](https://www.agentclash.dev/docs-md/concepts/tools-network-and-secrets) - See how pack-defined tools delegate to primitives, how outbound internet is controlled, and where secrets resolve. - [Artifacts](https://www.agentclash.dev/docs-md/concepts/artifacts) - Understand stored files, pack assets, run evidence, and signed downloads. ## Challenge packs - [Reference overview](https://www.agentclash.dev/docs-md/challenge-packs) - Map of every challenge-pack documentation page and where each topic is enforced in Go. - [Bundle YAML reference](https://www.agentclash.dev/docs-md/challenge-packs/bundle-yaml-reference) - Top-level bundle keys, manifests, constraints for prompt_eval versus native. - [Evaluation spec](https://www.agentclash.dev/docs-md/challenge-packs/evaluation-spec-reference) - Validators, targets, metric collectors, scorecard dimensions, strategies, post-execution captures. - [LLM judges](https://www.agentclash.dev/docs-md/challenge-packs/llm-judges) - Rubric, assertion, n_wise, and reference modes plus consensus keys and budgets. - [Tools, primitives & policy](https://www.agentclash.dev/docs-md/challenge-packs/tools-primitives-and-policy) - allowed_tool_kinds, built-in primitives, composed tools to http_request mocks and cycles. - [Sandbox & E2B](https://www.agentclash.dev/docs-md/challenge-packs/sandbox-and-e2b) - Pack sandbox block, outbound network CIDR lists, sandbox provider env, no-op modes. - [Input sets & cases](https://www.agentclash.dev/docs-md/challenge-packs/input-sets-and-cases) - Case inputs expectations artifacts legacy payloads and how payloads are persisted. - [Eval workflows & gates](https://www.agentclash.dev/docs-md/challenge-packs/eval-workflows-and-gates) - CLI eval start baseline scorecard compare gates and regression scope flags grounded in Cobra. ## Guides - [Write a Challenge Pack](https://www.agentclash.dev/docs-md/guides/write-a-challenge-pack) - Author a bundle YAML file, validate it, publish it, and understand the IDs AgentClash returns. - [Configure Runtime Resources](https://www.agentclash.dev/docs-md/guides/configure-runtime-resources) - Create secrets, provider accounts, model aliases, runtime profiles, and deployments in the order the product expects. - [Interpret Results](https://www.agentclash.dev/docs-md/guides/interpret-results) - Read timelines, scorecards, and ranking changes without getting lost in raw event volume. - [CI/CD Agent Gates](https://www.agentclash.dev/docs-md/guides/ci-cd-agent-gates) - Define the agent revision, workload, baseline, and release gate a pull request should run. - [CI/CD Workload Recipes](https://www.agentclash.dev/docs-md/guides/ci-cd-workload-recipes) - Pick realistic agent CI workloads for coding, research, support, ops, and long-horizon agents. - [Use with AI Tools](https://www.agentclash.dev/docs-md/guides/use-with-ai-tools) - Use llms.txt, the full bundle, and per-page markdown exports with assistants and coding agents. ## Agent Skills - [Skill Catalog](https://www.agentclash.dev/docs-md/agent-skills) - Choose the right AgentClash skill for setup, authoring, running, reviewing, regression, or CI. - [Challenge Pack Skills](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills) - Focused skills for planning, YAML authoring, input sets, scoring, judges, tools, artifacts, and publication. - [Agent Build Skills](https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills) - Skills for agent build specs, deployments, runtime resources, providers, secrets, and model aliases. - [CLI Setup Skill](https://www.agentclash.dev/docs-md/agent-skills/agentclash-cli-setup) - Configure the CLI, authenticate, select workspaces, and run doctor checks. - [Eval Runner Skill](https://www.agentclash.dev/docs-md/agent-skills/agentclash-eval-runner) - Start, follow, and report AgentClash evals and runs with useful evidence. - [Scorecard Reader Skill](https://www.agentclash.dev/docs-md/agent-skills/agentclash-scorecard-reader) - Turn rankings, scorecards, and replay evidence into engineering findings. - [Regression Flywheel Skill](https://www.agentclash.dev/docs-md/agent-skills/agentclash-regression-flywheel) - Promote useful run failures into regression suites and verify suite-only runs. - [CI Release Gate Skill](https://www.agentclash.dev/docs-md/agent-skills/agentclash-ci-release-gate) - Compare candidates against baselines and wire AgentClash gates into CI. ## Reference - [CLI](https://www.agentclash.dev/docs-md/reference/cli) - Commands, flags, and command groups generated from the Cobra source tree. - [Config](https://www.agentclash.dev/docs-md/reference/config) - Current environment surface pulled from the API, worker, CLI, and example config sources. ## Architecture - [Overview](https://www.agentclash.dev/docs-md/architecture/overview) - Web, API, worker, Postgres, Temporal, sandbox, and artifact storage in one picture. - [Orchestration](https://www.agentclash.dev/docs-md/architecture/orchestration) - How API requests become Temporal workflows and how the worker executes them. - [Sandbox Layer](https://www.agentclash.dev/docs-md/architecture/sandbox-layer) - Why execution is isolated behind a provider boundary and how E2B fits today. - [Data Model](https://www.agentclash.dev/docs-md/architecture/data-model) - The core entities behind workspaces, deployments, challenge packs, runs, and evidence. - [Evidence Loop](https://www.agentclash.dev/docs-md/architecture/evidence-loop) - How run events, artifacts, and scorecards move from execution into replay and review. - [Frontend](https://www.agentclash.dev/docs-md/architecture/frontend) - How the Next.js app is split between public product pages, authenticated app routes, and docs. ## Contributing - [Setup](https://www.agentclash.dev/docs-md/contributing/setup) - Clone the repo, boot the local stack, and choose the fastest dev loop for your task. - [Codebase Tour](https://www.agentclash.dev/docs-md/contributing/codebase-tour) - Map the top-level modules before you start changing APIs, workflows, or the web app. - [Testing](https://www.agentclash.dev/docs-md/contributing/testing) - Pick the smallest useful validation loop and use review checkpoints for scoped changes. ## Agent Skill Pages - [agentclash-agent-build-author](https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills/agentclash-agent-build-author) - Use when creating, editing, validating, or readying AgentClash agent builds and build versions, including agent identity, spec JSON, prompts, model/runtime expectations, tool bindings, and version readiness. - [agentclash-agent-deployment-setup](https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills/agentclash-agent-deployment-setup) - Use when creating, selecting, or diagnosing AgentClash agent deployments for runs, including ready build versions, runtime profiles, provider/model wiring, deployment IDs, workspace context, and run compatibility. - [agentclash-runtime-resources-setup](https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills/agentclash-runtime-resources-setup) - Use when configuring AgentClash workspace secrets, provider accounts, model catalog entries, model aliases, runtime profiles, workspace tools, and readiness checks required before agent builds, deployments, evals, or runs. - [agentclash-ci-release-gate](https://www.agentclash.dev/docs-md/agent-skills/agentclash-ci-release-gate) - Use when wiring AgentClash manifest-based CI gates, deciding whether a PR should run AgentClash, resolving baselines, running `agentclash ci run`, interpreting gate exit codes, collecting CI artifacts, or configuring regression promotion policy in GitHub Actions. - [agentclash-cli-setup](https://www.agentclash.dev/docs-md/agent-skills/agentclash-cli-setup) - Use when configuring the AgentClash CLI, authenticating with device login or tokens, selecting a workspace, saving default config with link, creating project config with init, resolving API URL precedence, or diagnosing CLI access against production, local, or self-hosted backends. - [agentclash-eval-runner](https://www.agentclash.dev/docs-md/agent-skills/agentclash-eval-runner) - Use when starting, following, inspecting, or reporting AgentClash eval runs with the CLI, especially eval start, run create, deployment selection, input set selection, suite-only scopes, repetitions, events, rankings, failures, and scorecards. - [agentclash-regression-flywheel](https://www.agentclash.dev/docs-md/agent-skills/agentclash-regression-flywheel) - Use when inspecting AgentClash run failure-review items, promoting useful failures into regression suites, editing regression suites or cases, and verifying suite-only reruns. - [agentclash-scorecard-reader](https://www.agentclash.dev/docs-md/agent-skills/agentclash-scorecard-reader) - Use when interpreting AgentClash rankings, scorecards, replay timelines, artifacts, LLM judge results, or failure-review evidence into source-backed findings and next actions. - [agentclash-challenge-pack-artifacts](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-artifacts) - Use when specifying AgentClash challenge pack assets, artifact references, produced file captures, evidence references, artifact upload/download expectations, and review-only evidence. - [agentclash-challenge-pack-input-sets](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-input-sets) - Use when designing AgentClash challenge pack cases and input sets for smoke, full benchmark, regression, edge-case, or CI suite-only coverage. - [agentclash-challenge-pack-llm-judges](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-llm-judges) - Use when configuring AgentClash LLM-as-judge scoring, judge prompts, rubrics, assertion/reference/n-wise modes, evidence inputs, scorecard dimensions, abstention behavior, and judge result interpretation. - [agentclash-challenge-pack-planner](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-planner) - Use when turning a vague AgentClash evaluation idea into a source-backed challenge pack plan with task boundaries, target agents, cases, input sets, scoring strategy, tools, artifacts, runtime policy, validation criteria, and handoff steps. - [agentclash-challenge-pack-scoring-validators](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-scoring-validators) - Use when defining deterministic AgentClash scoring validators, scorecard dimensions, evidence sources, pass/fail rules, numeric metrics, file checks, and validator result interpretation. - [agentclash-challenge-pack-tools-sandbox](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-tools-sandbox) - Use when defining AgentClash challenge pack tool access, sandbox runtime needs, filesystem expectations, network policy, command execution, and secret references. - [agentclash-challenge-pack-validation-publish](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-validation-publish) - Use when validating AgentClash challenge pack YAML, fixing schema/scoring/tool/asset errors, publishing runnable pack versions, recording returned IDs, and preparing next eval commands. - [agentclash-challenge-pack-yaml-author](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-yaml-author) - Use when writing or editing AgentClash challenge pack YAML, including pack/version metadata, execution mode, challenges, cases, input sets, scoring blocks, tools, sandbox settings, assets, and validation handoff.