# AgentClash Docs Bundle

Canonical docs home: https://www.agentclash.dev/docs
Machine-readable index: https://www.agentclash.dev/llms.txt

This file concatenates the currently shipped AgentClash docs pages and selected product page links into one markdown-oriented bundle for assistants, coding agents, and local retrieval pipelines.

## Public product pages

- [AI Agent Evaluation Platform](https://www.agentclash.dev/platform/agent-evaluation) - Public page for real-task AI agent evaluation, replay evidence, scorecards, challenge packs, and CI regression gates.
- [AI Agent Regression Testing](https://www.agentclash.dev/platform/agent-regression-testing) - Public page for baseline-versus-candidate agent regression testing, pull request gates, and release evidence.

## Blog posts

- [AI Agent Evaluation Needs Regression Testing, Not Just Benchmarks](https://www.agentclash.dev/blog/ai-agent-evaluation-regression-testing) - A practical guide to AI agent evaluation with real-task workloads, replay evidence, scorecards, challenge packs, and CI regression gates.
- [Why We Built AgentClash](https://www.agentclash.dev/blog/why-we-built-agentclash) - Static benchmarks leak. Leaderboards reward hype. We built something different.

---

# AI Agent Evaluation Needs Regression Testing, Not Just Benchmarks

A practical guide to AI agent evaluation with real-task workloads, replay evidence, scorecards, challenge packs, and CI regression gates.

Source: https://www.agentclash.dev/blog/ai-agent-evaluation-regression-testing
Published: 2026-05-07
Author: Atharva

Most AI agent evaluation starts in the wrong place.

A team tries a few prompts, compares a model leaderboard, watches one impressive demo, and ships the agent that looked best in a narrow test. Then the agent reaches a real workflow: messy tools, missing context, timeouts, partial files, stale APIs, and users who expect the whole task to finish.

That is where benchmark-only evaluation breaks down. Agents are not just text generators. They plan, call tools, modify state, inspect results, recover from mistakes, and decide when to stop. If the eval only checks the final answer, it misses the behavior that makes an agent safe or expensive to run in production.

Real AI agent evaluation needs regression testing.

## What an agent eval should prove

An agent eval should answer a practical release question: is this agent ready to do this job again, under the same constraints, without getting worse?

That means the eval needs more than a score. It needs a repeatable workload, a fair comparison, and enough evidence for a reviewer to understand the result. A useful AI agent evaluation platform should capture:

- the task definition and inputs
- the tool and network policy
- the agent's actions and observations
- produced files, logs, and artifacts
- correctness, cost, latency, and evidence quality
- the comparison between a candidate and a baseline

That is the difference between "the model looked good" and "this agent passed the release gate."

AgentClash is built around that second workflow. The [AI agent evaluation platform](https://www.agentclash.dev/platform/agent-evaluation) page explains the product surface, but the core idea is simple: run agents on the same real task with the same tools, then preserve replay evidence and scorecards so the result is reviewable.

## Why static benchmarks are not enough

Static benchmarks are useful for a first filter. They are not enough for shipping agents.

They usually measure isolated answers, not trajectories. They rarely include your private tools, your repository shape, your data contracts, your latency budget, or your failure modes. They can also hide the most important production question: did the agent solve the task in a way your team can trust and repeat?

For agents, the path matters.

Two agents can produce the same final answer while behaving very differently. One might use the right tool, verify its work, attach the required artifact, and stay inside budget. Another might hallucinate a file, skip the failing test, and still land near the correct prose answer. A final-answer-only benchmark treats those runs as similar. A real agent eval should not.

## Turn failures into challenge packs

The repeatable unit in AgentClash is a challenge pack: a workload definition with cases, inputs, tools, scoring rules, and artifacts. Challenge packs make agent evaluation operational because they turn a vague question into something runnable:

- What task should the agent perform?
- What inputs and fixtures should it see?
- Which tools are allowed?
- What evidence should be captured?
- Which validators or judges decide success?

When an agent fails in production or in a release test, the failure should become a reusable case. That is how the eval suite compounds. Instead of debugging the same mistake every few weeks, you promote it into coverage and make the next candidate prove it did not regress.

The docs for [writing a challenge pack](https://www.agentclash.dev/docs-md/guides/write-a-challenge-pack) are the right starting point if you want to turn a real workflow into a durable eval.

## Add regression gates to CI

The strongest agent eval is not a dashboard someone remembers to check. It is a gate in the release loop.

AI agent regression testing compares a candidate run against a known baseline. If the candidate gets worse on correctness, cost, latency, artifacts, or another scorecard dimension, the gate can block the pull request before the change reaches users.

That matters because agent quality can regress in subtle ways:

- a prompt edit improves one demo but breaks another workflow
- a model switch changes tool strategy or latency
- a sandbox image update changes installed dependencies
- a retrieval change gives the agent stale or incomplete context
- a tool permission change makes a previously solved task impossible

The [AI agent regression testing](https://www.agentclash.dev/platform/agent-regression-testing) page covers the product angle. The [CI/CD agent gates](https://www.agentclash.dev/docs-md/guides/ci-cd-agent-gates) guide covers the implementation path.

## What to look for in an agent evaluation tool

If you are comparing agent evaluation tools, look past the leaderboard. The tool should help your team make a release decision, debug failures, and improve the next test suite.

Useful capabilities include:

- real-task execution instead of prompt-only grading
- sandboxed runs with explicit tool and network policy
- replay timelines for tool calls and observations
- scorecards that separate correctness, cost, latency, and evidence
- artifact capture for files, logs, and outputs
- baseline versus candidate comparison
- CI gates for regressions
- a workflow for promoting failures into reusable tests

The goal is not to collect more numbers. The goal is to shorten the path from "this agent failed" to "we understand why, we fixed it, and the failure is now covered."

## The release loop

The loop should look like this:

1. Capture a real task as a challenge pack.
2. Run candidate and baseline agents under the same constraints.
3. Inspect replay evidence and scorecards.
4. Promote important failures into regression cases.
5. Gate future changes in CI.

That is how AI agent evaluation becomes engineering infrastructure instead of a one-off experiment.

Benchmarks can tell you where to look. Regression testing tells you whether the agent is safe to ship again.

---

# Why We Built AgentClash

Static benchmarks leak. Leaderboards reward hype. We built something different.

Source: https://www.agentclash.dev/blog/why-we-built-agentclash
Published: 2026-03-23
Author: Atharva

Your benchmarks are lying to you.

Every team picking an AI model today is doing the same thing: reading someone else's leaderboard, running a few prompts in a playground, and shipping based on vibes. The benchmarks are gamed. The leaderboards reward hype. And you're left guessing.

We built AgentClash because we were tired of this.

## The problem

Static test sets leak into training data. Crowd-voted rankings measure popularity, not capability. You test agents in isolation, one at a time, and compare scores that were generated under completely different conditions.

None of this tells you which model is actually better **for your task**.

## What we're building

AgentClash puts your models on the same real task, at the same time. Same tools, same constraints, same environment. Scored live on completion, speed, token efficiency, and tool strategy.

Step-by-step replays show exactly why one agent won and another didn't. Every failure gets captured, classified, and turned into a regression test — automatically. The more you run, the smarter your eval suite gets.

## Why opensource

Because eval infrastructure shouldn't be a black box. You should be able to see exactly how models are scored, modify the scoring to fit your use case, and run it on your own infra.

We're building this in the open. Every commit is public. Every design decision is documented.

## What's next

We're in private beta. If you're shipping agents and you're tired of guessing which model to use, [join the waitlist](https://agentclash.dev).

Follow the build on [GitHub](https://github.com/agentclash/agentclash).

---

# AgentClash Documentation

Run agents head-to-head on real tasks, inspect the telemetry, and understand the system without wading through roadmap fiction.

Source: https://www.agentclash.dev/docs
Markdown export: https://www.agentclash.dev/docs-md

AgentClash runs agents against the same task, with the same tools and time budget, then shows you who finished, who stalled, and where the run broke.

These docs are layered for three kinds of readers:

- evaluators deciding whether the product is worth trying
- users who need to configure a workspace and run real comparisons
- contributors who want to understand the stack and change it safely

The current public surface is still early. This docs pass only covers behavior that is already visible in the repo today: the CLI, the local stack, the current run model, and the main runtime components.

Start with the hosted quickstart if you want the shortest path to a real command sequence. Start with self-host if you want the full local stack on your machine. Start with architecture if you are here to hack on the code. For **challenge pack YAML, scoring, tooling, sandboxes, judges, and CLI eval flows**, start at [Challenge pack reference](https://www.agentclash.dev/docs-md/challenge-packs).

---

# Hosted Quickstart

Validate the CLI against the hosted production backend, set a workspace, and get to your first runnable command in a few minutes.

Source: https://www.agentclash.dev/docs/getting-started/quickstart
Markdown export: https://www.agentclash.dev/docs-md/getting-started/quickstart

This path is for people changing the CLI or trying the product without booting the whole stack locally.

> Note: The hosted quickstart assumes your workspace already has challenge packs and
> deployments. If it does not, stop after `link` and then author a pack with
> `challenge-pack init`; you have still verified auth, connectivity, and
> workspace selection.

## 1. Install the CLI

```bash
npm i -g agentclash
```

## 2. Point the CLI at production and log in

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
agentclash auth login --device
```

Use `--device` when you are in a remote shell or do not want the CLI to open a browser automatically.

## 3. Link a workspace

```bash
agentclash link
```

The CLI resolves the API base URL in this order:

```text
--api-url > AGENTCLASH_API_URL > saved user config > http://localhost:8080
```

`agentclash link` saves the selected workspace in user config so later commands do not need raw IDs by default.

## 4. Choose your next path

```bash
agentclash doctor
agentclash eval start --help
```

If the workspace is already seeded with challenge packs and agent deployments, create and follow a run:

```bash
agentclash eval start --follow
```

If the workspace is empty, scaffold a starter pack first:

```bash
agentclash challenge-pack init support-eval.yaml
agentclash challenge-pack validate support-eval.yaml
agentclash challenge-pack publish support-eval.yaml
agentclash eval start --follow
agentclash baseline set
agentclash eval scorecard
```

## Verification

You should now have:

- a valid CLI login
- a default workspace linked locally
- a working connection to the hosted API
- either a created run or enough context to see what the workspace is missing
- a clear next step: publish a challenge pack, start an eval, or save a baseline

## See also

- [Self-Host](https://www.agentclash.dev/docs-md/getting-started/self-host)
- [Runs and Evals](https://www.agentclash.dev/docs-md/concepts/runs-and-evals)
- [CLI Reference](https://www.agentclash.dev/docs-md/reference/cli)

---

# Self-Host Starter

Bring up the local AgentClash stack with the repo’s existing scripts and understand which dependencies are mandatory versus optional.

Source: https://www.agentclash.dev/docs/getting-started/self-host
Markdown export: https://www.agentclash.dev/docs-md/getting-started/self-host

This is the shortest honest path to a local AgentClash environment today. It is based on the repo’s existing development scripts, not an imagined one-click installer.

> Warning: The repo does not currently ship a Helm chart or a polished production
> installer. What it does ship is a local stack script plus documented Railway
> deployment building blocks for the backend.

## Prerequisites

- Go `1.25+`
- Docker
- Temporal CLI
- Node.js `20+`
- `pnpm`
- `psql`

## 1. Start the local stack

From the repo root:

```bash
./scripts/dev/start-local-stack.sh
```

This script starts PostgreSQL and Redis, applies migrations, launches the Temporal dev server if needed, then starts the API server and worker. Logs are written under `/tmp/agentclash-local-stack/`.

## 2. Start the web app

```bash
cd web
pnpm install
pnpm dev
```

The web app runs at `http://localhost:3000`.

## 3. Seed a runnable fixture

Back in the repo root:

```bash
./scripts/dev/seed-local-run-fixture.sh
./scripts/dev/curl-create-run.sh
```

Without a real sandbox provider such as E2B, native runs can still be created, but the model-backed execution path will not complete successfully.

## Required vs optional services

- Required: PostgreSQL, Temporal, API server, worker
- Optional: Redis for event fanout and rate limiting
- Optional: E2B for sandboxed native execution
- Optional: S3-compatible storage for production artifact storage

## Production notes

The repo’s documented production building blocks today are:

- Railway for the API server and worker
- Temporal Cloud for orchestration
- Vercel for `web/`
- S3-compatible storage for artifacts

## Verification

You should be able to hit:

```bash
curl http://localhost:8080/healthz
```

Then open `http://localhost:3000`.

## See also

- [Hosted Quickstart](https://www.agentclash.dev/docs-md/getting-started/quickstart)
- [Architecture Overview](https://www.agentclash.dev/docs-md/architecture/overview)
- [Contributor Setup](https://www.agentclash.dev/docs-md/contributing/setup)

---

# First Eval Walkthrough

Use the current seeded local path to create a run, stream events, and inspect ranking output without inventing setup that is not in the repo.

Source: https://www.agentclash.dev/docs/getting-started/first-eval
Markdown export: https://www.agentclash.dev/docs-md/getting-started/first-eval

This walkthrough sticks to what the repo already supports today: seed local data, create a run, stream events, and inspect the result.

## 1. Bring up the local stack

From the repo root:

```bash
./scripts/dev/start-local-stack.sh
```

If you want the browser UI too:

```bash
cd web
pnpm install
pnpm dev
```

## 2. Seed a runnable fixture

Back in the repo root:

```bash
./scripts/dev/seed-local-run-fixture.sh
```

That script seeds enough data to create a local run through the API.

## 3. Create the run

You can hit the API directly:

```bash
./scripts/dev/curl-create-run.sh
```

Or, if you are using the CLI against a prepared workspace, create and follow the run there:

```bash
agentclash eval start --follow
```

`eval start` is the workflow-first wrapper around `run create` — it resolves
challenge packs, versions, input sets, and deployments by name or interactive
selection. Use `agentclash run create` directly when you want to pass IDs
explicitly (CI scripts, automation).

## 4. Inspect the result

Once you have a run ID, inspect its status and ranking:

```bash
agentclash run get <RUN_ID>
agentclash run ranking <RUN_ID>
```

If the web app is running, open the workspace run detail view in the browser and inspect the replay and scorecard surfaces from there.

## What you should see

- a run record created in the workspace
- event streaming during execution when you follow the run
- a ranking view once the backend has enough completed run-agent results to score

> Warning: Without a real sandbox provider such as E2B, the native model-backed path can
> still stall or fail after run creation. That is expected in the unconfigured
> local setup.

## See also

- [Self-Host Starter](https://www.agentclash.dev/docs-md/getting-started/self-host)
- [Runs and Evals](https://www.agentclash.dev/docs-md/concepts/runs-and-evals)
- [Architecture Overview](https://www.agentclash.dev/docs-md/architecture/overview)

---

# Runs and Evals

The product language around runs and evals is easy to blur. The current codebase makes one distinction especially important.

Source: https://www.agentclash.dev/docs/concepts/runs-and-evals
Markdown export: https://www.agentclash.dev/docs-md/concepts/runs-and-evals

A **run** is the concrete execution object you create, stream, rank, compare, and inspect in AgentClash today.

In the current user-facing product surface, `run` is the first-class noun:

- `agentclash run create`
- `agentclash run list`
- `agentclash run ranking`
- `agentclash compare gate --baseline <RUN_ID> --candidate <RUN_ID>`

The workflow-first surface (`agentclash eval start`, `agentclash baseline set`,
`agentclash eval scorecard`) wraps these resource commands with name-based
selectors and a bookmarked baseline so day-to-day evaluation does not require
juggling raw run IDs. The resource commands above remain the canonical
ID-centric path for CI and automation.

A run is not just one model token stream. It is the container for a scored evaluation attempt inside a workspace, including the challenge pack version, selected agent deployments, lifecycle timestamps, and ranking output.

The word **eval** is broader. People use it to mean “the experiment I am trying to run” or “the graded set of results I care about.” That is reasonable, but if you are reading the code or the CLI, you should anchor on this:

- **Run** = the concrete resource you create and query.
- **Eval** = the broader exercise or outcome you are trying to measure.

There are also places in the codebase that refer to eval sessions, but the main shipped workflow today still revolves around runs and ranked run results. If you keep that in your head, the CLI and API are much easier to follow.

## Practical rule of thumb

Use **run** when you are talking about a real resource ID. Use **eval** when you are talking about the experiment design or the larger testing loop.

## See also

- [Hosted Quickstart](https://www.agentclash.dev/docs-md/getting-started/quickstart)
- [CLI Reference](https://www.agentclash.dev/docs-md/reference/cli)

---

# Agents and Deployments

Understand how AgentClash turns a build plus runtime/provider resources into a concrete deployment that can be scheduled into a run.

Source: https://www.agentclash.dev/docs/concepts/agents-and-deployments
Markdown export: https://www.agentclash.dev/docs-md/concepts/agents-and-deployments

A deployment is the workspace-scoped runnable target that AgentClash can attach to a run.

## Why a deployment exists at all

AgentClash is stricter than a typical playground because it has to compare like with like. A model name by itself is not enough. The scheduler needs a concrete object that says:

- which build is being run
- which build version is current
- which runtime policy applies
- which provider credentials or model mapping are attached

That concrete object is the deployment.

## The current creation contract

The current API schema for `CreateAgentDeploymentRequest` requires:

- `name`
- `agent_build_id`
- `build_version_id`
- `runtime_profile_id`

It also supports these optional fields:

- `provider_account_id`
- `model_alias_id`
- `deployment_config`

The OpenAPI description also says only ready build versions can be deployed.

<DiagramAgentsToRun />

## Runtime profiles are the execution envelope

A runtime profile defines how aggressive or constrained execution should be. In the current API and web types, a runtime profile carries fields like:

- `execution_target`
- `trace_mode`
- `max_iterations`
- `max_tool_calls`
- `step_timeout_seconds`
- `run_timeout_seconds`
- `profile_config`

That last field matters. The native executor reads runtime-profile sandbox overrides from `profile_config`, including things like filesystem roots and `allow_shell` or `allow_network` toggles.

The clean mental model is:

- the challenge pack defines what the workload wants
- the runtime profile defines execution ceilings and local overrides
- the deployment binds those choices to a runnable target

## Provider accounts are how credentials enter the system

A provider account is a workspace resource with:

- `provider_key`
- `name`
- `credential_reference`
- optional `limits_config`

The important detail is how credentials are stored.

If you create a provider account with a raw `api_key`, the infrastructure manager stores that value as a workspace secret and rewrites the credential reference automatically to:

```text
workspace-secret://PROVIDER_<PROVIDER_KEY>_API_KEY
```

So the product already prefers indirection over plaintext credentials on the resource itself.

## Model aliases are not just display sugar

The user question usually comes out as “provider alias” or “model alias.” In the current product surface, the real resource is `model alias`.

A model alias maps a workspace-friendly key to a model catalog entry, and can optionally be tied to a provider account. The current fields are:

- `alias_key`
- `display_name`
- `model_catalog_entry_id`
- optional `provider_account_id`

That gives you a stable name inside the workspace even if the underlying provider model identifier is ugly or if you need multiple account-specific mappings.

## A deployment is where these pieces come together

A good way to think about the chain is:

- agent build version: what logic is being deployed
- runtime profile: how it is allowed to execute
- provider account: which credentials or spend limits back external model calls
- model alias: which model selection the deployment should use consistently
- deployment: the runnable handle used by runs

This is why the docs should not collapse deployment into “selected model.” The object is richer than that.

## What the UI and CLI expose today

The current repo already exposes the resource model across multiple surfaces:

- CLI `deployment create` and `deployment list`
- workspace pages for runtime profiles, provider accounts, model aliases, deployments, secrets, and tools
- run creation UI that asks for challenge pack and deployment selection separately

That separation is deliberate. A run is an execution event. A deployment is reusable infrastructure state.

## What is stable versus still moving

The stable part is the dependency chain and the API surface. The still-moving part is how richly each resource is edited in the UI and how much automation exists around them.

So the right docs posture is:

- document the current fields and flows precisely
- avoid pretending the deployment UX is fully polished
- treat the resource model itself as real and important

## See also

- [Configure Runtime Resources](../guides/configure-runtime-resources)
- [Challenge Packs and Inputs](../concepts/challenge-packs-and-inputs)
- [Tools, Network, and Secrets](../concepts/tools-network-and-secrets)
- [CLI Reference](../reference/cli)

---

# Challenge Packs and Inputs

Learn what a challenge pack really is in AgentClash, how the bundle is structured, and how inputs become runnable cases.

Source: https://www.agentclash.dev/docs/concepts/challenge-packs-and-inputs
Markdown export: https://www.agentclash.dev/docs-md/concepts/challenge-packs-and-inputs

A challenge pack is a versioned YAML bundle that defines the workload, scoring contract, execution policy, and input sets for a repeatable evaluation.

For field-by-field YAML, scoring enums, judge modes, primitives, sandbox flags, and CLI eval flows, use the **[challenge pack reference hub](../challenge-packs)**—it is written for pack authors who need repo-accurate depth, not a marketing overview.

## What makes it a challenge pack instead of a prompt

A challenge pack is not just a task description. In the current repo, a runnable pack carries enough structure for AgentClash to do four jobs consistently:

- execute the same workload again later
- attach one or more deployments to that workload
- score the result using a versioned evaluation spec
- preserve the relationship between a failed case and the evidence that exposed it

That is why the API does not ask you to start a run with a loose prompt blob. It asks for a `challenge_pack_version_id`.

## The current bundle shape

The parser in `backend/internal/challengepack/bundle.go` expects a YAML bundle with these top-level sections:

- `pack`: human metadata like `slug`, `name`, and `family`
- `version`: the executable version block
- `tools`: optional pack-defined composed tools
- `challenges`: the workload definitions
- `input_sets`: the concrete runnable cases

A pack becomes runnable through its `version` block. That block currently carries the load-bearing execution data:

- `number`: the pack version number
- `execution_mode`: `native` or `prompt_eval`
- `tool_policy`: allowed tool kinds and runtime toggles
- `filesystem`: optional filesystem constraints
- `sandbox`: network, env, package, and template configuration
- `evaluation_spec`: the scoring contract
- `assets`: version-scoped files or artifact references

<DiagramChallengePackBundleShape />

## Challenge, input set, case, and asset are different things

These terms are easy to blur together. Do not blur them.

- challenge pack: the entire versioned bundle
- challenge: one task definition inside the bundle
- input set: one named collection of runnable cases for that pack version
- case: one concrete workload item tied to a challenge via `challenge_key`
- asset: a file-like dependency declared by key and path, optionally backed by a stored artifact ID

The bundle model in the repo uses `input_sets[].cases[]` as the main execution unit. A case can carry:

- `payload`
- structured `inputs`
- structured `expectations`
- `artifacts`
- case-local `assets`

That makes cases more expressive than a single flat prompt. They can reference files, expected outputs, and evaluator inputs without inventing an ad-hoc schema per benchmark.

## The evaluation spec is part of the pack, not global product config

The current evaluation docs are explicit about this. The scoring contract lives inside the pack version’s manifest. That means the pack defines:

- validator keys and types
- metrics and collectors
- runtime limits
- pricing rows used for cost scoring
- scorecard dimensions and normalization thresholds

This matters because AgentClash needs scorecards to remain auditable. When a run is scored, the product can persist the exact `evaluation_spec_id` that was used. The publish response already returns that ID.

## Execution mode matters

Two execution modes are visible in the current code and examples:

- `prompt_eval`: lighter-weight packs that focus on prompt-style evaluation
- `native`: packs that can carry sandbox, tool, and execution policy for richer runs

You should choose the simpler mode unless the workload really needs a sandbox, files, or tool execution.

## Sandbox, tool policy, and internet access belong to the pack version

This is one of the most important design choices in the repo.

The pack version can say what the evaluator is allowed to do:

- which tool kinds are allowed
- whether shell or network access is enabled
- what network CIDRs are allowed
- which additional packages should exist in the sandbox
- which env vars are injected as literal values

In other words, the pack is not only content. It is also policy.

## Assets and artifact-backed packs

The version block, challenge blocks, and case blocks can all reference assets. Each asset has a `key` and `path`, and may also carry `media_type`, `kind`, or `artifact_id`.

That gives you two useful authoring patterns:

- check small fixtures into the pack and refer to them by path
- attach previously uploaded workspace artifacts and refer to them by `artifact_id`

Validation already checks that asset references are real. If a case or expectation points at an artifact key that was never declared, publish-time validation fails.

## Publish and validate are first-class workflow steps

The API and CLI already expose the authoring loop directly:

- validate with `POST /v1/workspaces/{workspaceID}/challenge-packs/validate`
- publish with `POST /v1/workspaces/{workspaceID}/challenge-packs`
- list packs with `GET /v1/workspaces/{workspaceID}/challenge-packs`
- list published input sets with `GET /v1/workspaces/{workspaceID}/challenge-pack-versions/{versionID}/input-sets`

The CLI mirrors that with:

```bash
agentclash challenge-pack validate <file>
agentclash challenge-pack publish <file>
agentclash challenge-pack list
```

Publish returns more than a pack ID. It returns:

- `challenge_pack_id`
- `challenge_pack_version_id`
- `evaluation_spec_id`
- `input_set_ids`
- optional `bundle_artifact_id`

That tells you the pack bundle is treated as a concrete artifact of record, not just transient YAML.

## How a pack becomes a run

A run binds:

- one `challenge_pack_version_id`
- one or more deployment IDs
- optionally one selected input set

From there the worker resolves the pack manifest, execution policy, assets, and scoring contract into the runtime path.

That is why challenge packs are foundational in AgentClash. They are the unit that makes two runs comparable without hand-waving.

## See also

- [Write a Challenge Pack](../guides/write-a-challenge-pack)
- [Agents and Deployments](../concepts/agents-and-deployments)
- [Tools, Network, and Secrets](../concepts/tools-network-and-secrets)
- [Artifacts](../concepts/artifacts)

---

# Replay and Scorecards

Understand how run events become a readable timeline, a defensible score, and a reusable evidence trail.

Source: https://www.agentclash.dev/docs/concepts/replay-and-scorecards
Markdown export: https://www.agentclash.dev/docs-md/concepts/replay-and-scorecards

Replay is the ordered event history of a run. Scorecards are the condensed judgments and summaries built from that evidence.

## Why AgentClash stores both

If you only keep a final score, you lose the explanation. If you only keep raw logs, nobody can compare anything quickly. AgentClash needs both because the product is about arguing from evidence, not from vibes.

The canonical event envelope work in the repo makes that boundary explicit. Execution emits structured events. The frontend and downstream analysis layers can replay those events as a timeline. Scorecards then turn the same evidence into something compact enough to rank, filter, and compare.

## Replay is the source of truth

Think of replay as the forensic record. It answers questions like:

- what happened first
- when the agent called a tool
- when the sandbox or infrastructure layer failed
- when artifacts or outputs were produced
- what the final terminal state was

That is why replay data should be preserved even when the top-line score looks obvious. The run may still teach you something the score alone cannot show.

## Scorecards are the decision layer

A scorecard should make a run legible in seconds. The exact schema will keep evolving, but the purpose is stable:

- summarize whether the run passed, failed, or degraded
- attach the evidence that justifies that judgment
- make comparisons across runs possible without rereading the full trace

<DiagramReplayVsScorecards />

## How to use both together

The fastest useful workflow is:

1. start with the scorecard to see whether the run is healthy
2. move to the replay timeline to understand why
3. inspect artifacts when the failure is ambiguous or multi-step
4. compare against another run only after you trust the evidence on each side

That sequence sounds basic, but it prevents a common failure mode: overreacting to a single score change without checking whether the underlying run actually exercised the same path.

## See also

- [Interpret Results](../guides/interpret-results)
- [Evidence Loop](../architecture/evidence-loop)
- [Challenge Packs and Inputs](../concepts/challenge-packs-and-inputs)
- [Data Model](../architecture/data-model)

---

# Tools, Network, and Secrets

Learn the difference between workspace tools, pack-defined tools, and engine primitives, and how network and secret handling are constrained.

Source: https://www.agentclash.dev/docs/concepts/tools-network-and-secrets
Markdown export: https://www.agentclash.dev/docs-md/concepts/tools-network-and-secrets

AgentClash has more than one “tool” layer. If you do not separate them mentally, the rest of the runtime model gets confusing fast.

## There are three different layers to know

### 1. Workspace tool resources

The workspace API already exposes first-class `tools` resources with fields like:

- `name`
- `tool_kind`
- `capability_key`
- `definition`
- `lifecycle_status`

These are infrastructure resources that live alongside runtime profiles, provider accounts, and model aliases.

### 2. Pack-defined composed tools

Inside a challenge pack, the optional top-level `tools` block lets a pack author define custom tool interfaces that the evaluated agent can see.

Those definitions are pack-local. They are part of the authored benchmark bundle.

### 3. Engine primitives

At the bottom are the built-in executor primitives, like `http_request`. These are the concrete operations the runtime knows how to execute safely.

A pack-defined tool can delegate to a primitive. That is the key distinction.

## Primitive versus composed tool

The current validation code expects composed tools to look roughly like this:

```yaml
tools:
  custom:
    - name: check_inventory
      description: Check inventory by SKU
      parameters:
        type: object
        properties:
          sku:
            type: string
      implementation:
        primitive: http_request
        args:
          method: GET
          url: https://api.example.com/inventory/${sku}
          headers:
            Authorization: Bearer ${secrets.INVENTORY_API_KEY}
```

What this means:

- `check_inventory` is the tool name the agent sees
- `http_request` is the engine primitive that actually runs
- `args` is the templated mapping from tool parameters to primitive inputs

So when people ask “primitive tools vs actual tools,” the clean answer is:

- primitives are built-in executor operations
- composed tools are the author-defined tool contracts that delegate to those primitives
- workspace tool resources are a separate infrastructure surface

## Validation is strict on purpose

The current parser and tests already reject several dangerous or ambiguous cases:

- unknown template placeholders like `${missing}`
- self-referencing tools where a tool delegates to itself
- delegation cycles across composed tools
- invalid JSON-schema parameter definitions
- missing primitive names or missing args blocks

That strictness is good. A benchmark bundle should fail at publish time rather than fail mysteriously at run time.

## Tool kinds are a separate gate from tool names

The sandbox policy also carries `allowed_tool_kinds`. That means the pack can say which broad categories are available, for example:

- `file`
- `shell`
- `network`

This is different from a specific composed-tool name. A pack might define `check_inventory`, but the runtime still checks whether the underlying kind is allowed.

## Internet access is not automatic

The current runtime does not treat network as free ambient capability.

There are at least three control points visible in the repo:

- the sandbox/tool policy starts with network disabled by default
- the pack can enable outbound networking through `sandbox.network_access` and related policy toggles
- the `http_request` primitive validates the target URL and CIDR allowlist before making a request

The current `http_request.py` helper does all of this:

- allows only `http` and `https`
- rejects missing hosts
- resolves DNS and checks resolved addresses
- blocks private, loopback, link-local, reserved, and multicast addresses unless explicitly allowlisted
- enforces request and response body limits
- sanitizes error handling so secret-bearing values do not leak back to the agent

So the current answer to “how can you call the external internet?” is:

- use a tool path that ultimately delegates to a network-capable primitive like `http_request`
- enable network access in the pack/runtime policy
- keep the destination within the permitted network rules

## Secrets live outside the pack

The product already exposes workspace-scoped secrets as a first-class surface.

You can:

- list secret keys
- set a secret value
- delete a secret

The list endpoint intentionally returns metadata only. Secret values never come back out.

The CLI surface is:

```bash
agentclash secret list
agentclash secret set <KEY>
agentclash secret delete <KEY>
```

## Where secret references resolve

There are two distinct secret-reference patterns in the current code:

- `workspace-secret://KEY` for provider credential resolution
- `${secrets.KEY}` inside composed-tool argument templates

These are not interchangeable.

`workspace-secret://KEY` is used when the provider layer resolves account credentials.

`${secrets.KEY}` is used during composed-tool argument substitution. The engine then decides whether the target primitive is allowed to receive secret-bearing args.

## Only hardened primitives can accept `${secrets.*}`

This is a security boundary, not a convenience feature.

The current `primitive_secrets.go` file says only secret-safe primitives may receive `${secrets.*}` substitutions, and today that allowlist intentionally includes only `http_request`.

The reason is straightforward:

- secrets must not end up in argv
- secrets must not land in readable sandbox files
- secrets must not come back in response headers or stderr
- secrets must not be echoed into the agent context accidentally

That is also why sandbox `env_vars` are literal-only. The executor explicitly rejects `${...}` placeholders there, and the code comment tells pack authors to use `http_request` headers instead when remote authentication is needed.

## See also

- [Write a Challenge Pack](../guides/write-a-challenge-pack)
- [Configure Runtime Resources](../guides/configure-runtime-resources)
- [Sandbox Layer](../architecture/sandbox-layer)
- [Artifacts](../concepts/artifacts)

---

# Artifacts

Understand workspace artifacts, pack assets, run evidence files, and how downloads are signed and delivered.

Source: https://www.agentclash.dev/docs/concepts/artifacts
Markdown export: https://www.agentclash.dev/docs-md/concepts/artifacts

An artifact is a stored file object that AgentClash can keep at workspace scope, attach to runs, reference from challenge packs, and expose through signed downloads.

## What an artifact is in the current product

The current artifact response shape already tells you the core model:

- `workspace_id`
- optional `run_id`
- optional `run_agent_id`
- `artifact_type`
- optional `content_type`
- optional `size_bytes`
- optional `checksum_sha256`
- `visibility`
- `metadata`
- `created_at`

That means artifacts are not only run outputs. They can exist before a run and be used as reusable workspace context.

## There are two important artifact roles

### 1. Workspace-managed files

The workspace UI and API let you upload arbitrary files to the workspace artifact store.

The current artifacts page describes them as files you can:

- use as context in challenge packs
- attach to runs

That is the right mental model. Upload once, then reuse where it makes sense.

### 2. Run evidence files

Runs and replay events can also point at artifacts. Those become part of the evidence trail for later inspection, scoring, or failure review.

That is why replay and failure-review models carry artifact references. Artifacts are part of the audit trail, not just incidental attachments.

<DiagramArtifactFlow />

## Challenge-pack assets and artifact refs

Challenge packs do not embed giant blobs directly into YAML. They declare assets and then refer to them by key.

The bundle model supports assets at multiple levels:

- `version.assets`
- `challenge.assets`
- `case.assets`

Each asset can carry:

- `key`
- `path`
- `kind`
- `media_type`
- optional `artifact_id`

Then other parts of the pack can reference those declared assets using:

- `artifact_refs`
- `artifact_key`
- expectation sources like `artifact:<key>`

Validation already checks that those references are real. If the key or artifact ID does not resolve, validation fails before publish.

## The published bundle is itself tracked as an artifact

When you publish a challenge pack, the response may include `bundle_artifact_id`.

That is an important detail because it means the authored pack bundle is treated as a stored object of record. The product does not only store parsed rows; it can also retain the published source bundle as an artifact.

## Upload and download flow

The current API surface is:

- `GET /v1/workspaces/{workspaceID}/artifacts`
- `POST /v1/workspaces/{workspaceID}/artifacts`
- `GET /v1/artifacts/{artifactID}/download`
- public content route at `/artifacts/{artifactID}/content`

The upload path is multipart and supports:

- `file`
- `artifact_type`
- optional `run_id`
- optional `run_agent_id`
- optional `metadata`

The download flow is intentionally indirect. The API returns a signed URL and expiry, then the actual file content is served through the public content endpoint. That keeps raw artifact content behind signed access rather than exposing direct permanent object URLs.

## Visibility and metadata matter

Artifacts also carry `visibility` and arbitrary JSON `metadata`.

The current UI uses metadata to recover nicer names like `original_filename`. If no filename metadata exists, it falls back to showing the artifact ID prefix.

That sounds minor, but it is a sign that metadata is a first-class part of the artifact model, not just a debug dump.

## When to use artifacts versus inline YAML data

Use inline bundle data when:

- the value is small
- it belongs directly in the challenge definition
- you want the pack to stay self-contained

Use artifacts when:

- the file is large or binary
- you want reuse across packs or runs
- the same file should be downloadable later
- the evidence trail should preserve it as a named object

## See also

- [Challenge Packs and Inputs](../concepts/challenge-packs-and-inputs)
- [Write a Challenge Pack](../guides/write-a-challenge-pack)
- [Evidence Loop](../architecture/evidence-loop)
- [Data Model](../architecture/data-model)

---

# Challenge pack documentation

Deep, consumer-facing YAML and runtime reference keyed to the parsers, validators, and workers in this repository.

Source: https://www.agentclash.dev/docs/challenge-packs
Markdown export: https://www.agentclash.dev/docs-md/challenge-packs

These pages complement the short concept guide [Challenge packs and inputs](../concepts/challenge-packs-and-inputs). They spell out everything a benchmark author needs to publish a pack that survives server-side parsing, validation, and execution.

Everything here is keyed to shipped code paths—not roadmap language. When behavior changes upstream, validate again with:

```bash
agentclash challenge-pack validate your-pack.yaml
```

## What's covered

| Topic | Use when you… | Anchor in repo |
| --- | --- | --- |
| [Bundle YAML reference](bundle-yaml-reference) | Need the authoritative field list and `prompt_eval` vs `native` rules | `backend/internal/challengepack/bundle.go`, `validation.go` |
| [Evaluation spec reference](evaluation-spec-reference) | Choose validator types, wire `target`/`expected_from`, add metrics | `backend/internal/scoring/spec.go`, `validation.go`, `engine_*.go` |
| [LLM judges](llm-judges) | Add rubrics, assertions, pairwise comparison, budgets | `backend/internal/scoring/spec.go`, `validation_judges.go` |
| [Tools, primitives & policy](tools-primitives-and-policy) | Decide `allowed_tool_kinds`, map composed tools → primitives | `backend/internal/engine/primitive_tools.go`, `tool_registry.go`, `sandbox/sandbox.go` |
| [Sandbox & E2B](sandbox-and-e2b) | Tune network_allowlist, template id, sandbox provider | `backend/internal/challengepack/bundle.go`, `sandbox/e2b/`, worker config |
| [Input sets & cases](input-sets-and-cases) | Model fixtures, typed inputs and expectations | `challengepack/bundle.go` (`CaseDefinition`), `StoredCaseDocument` |
| [Eval workflows & gates](eval-workflows-and-gates) | Chain `eval start`, baselines, scorecards, comparisons | `cli/cmd/eval.go`, `baseline.go`, `compare.go` |

## See also

- [Write a challenge pack](../guides/write-a-challenge-pack) — minimal happy-path checklist
- [Tools, network, and secrets](../concepts/tools-network-and-secrets) — mental model overview
- [Sandbox layer](../architecture/sandbox-layer) — provider boundary explanation

---

# Bundle YAML reference

Structured reference for challenge-pack bundles as parsed by backend/internal/challengepack.

Source: https://www.agentclash.dev/docs/challenge-packs/bundle-yaml-reference
Markdown export: https://www.agentclash.dev/docs-md/challenge-packs/bundle-yaml-reference

AgentClash challenge packs ship as **one YAML file** decoded into `challengepack.Bundle` (`backend/internal/challengepack/bundle.go`). Publication stores a JSON manifest combining the pack body with metadata for execution and scoring (`ManifestJSON` in the same file).

## Top-level keys

All keys listed here are persisted or validated somewhere in publish/validate—not decorative.

### `pack` (required)

Human metadata surfaced in API and CLI lists:

- `slug` — required, workspace-unique branding string
- `name` — required display name  
- `family` — required grouping/category string  
- `description` — optional

### `version` (required)

The executable spine of the bundle:

| Field | Notes |
| --- | --- |
| `number` | Required positive integer (`int32`). Bumps when you materially change validators, tooling, sandbox, scores, etc. |
| `execution_mode` | `native` **or** `prompt_eval`. Empty accepts as legacy but you should always set explicitly. See execution rules below. |
| `tool_policy` | Arbitrary-shaped map mirrored into manifest `tool_policy`. **Must be omitted** when `execution_mode` is `prompt_eval` (validated in `ValidateBundle`). |
| `filesystem` | Optional filesystem constraints blob (same manifest path as today). |
| `sandbox` | Optional `SandboxConfig`. **Forbidden** when `execution_mode` is `prompt_eval`. |
| `evaluation_spec` | Required scoring contract unmarshalled through scoring package strict decode (`scoring.StrictDecodeEvaluationSpec`). Typos surface at parse time rather than silently defaulting. |
| `assets` | Version-scoped `AssetReference[]` keyed for cases and uploads. |

`SandboxConfig` (`bundle.go`) currently supports:

- `network_access` (bool)
- `network_allowlist` (CIDR strings; invalid CIDR rejects publish)
- `env_vars` (map of string literals—see [Sandbox & E2B](sandbox-and-e2b))
- `additional_packages` (APT-style names constrained by regexp in validation)
- `sandbox_template_id` (optional provider template override)

Manifest JSON merges `sandbox_template_id` from `version.sandbox` into the serialized `version` block for backends that historically keyed off that field separately.

### `tools` (optional)

Optional map keyed by integration style; authoring today uses **`tools.custom`** as an array of composed tools (`validation.go`). **Must be empty** when mode is `prompt_eval`.

See [Tools, primitives & policy](tools-primitives-and-policy).

### `challenges` (required, non-empty)

Each challenge includes:

| Field | Required | Purpose |
| --- | --- | --- |
| `key` | yes | Stable id referenced by cases |
| `title` | yes | Display |
| `category` | yes | Stored metadata |
| `difficulty` | yes | Stored metadata (`easy`/`medium`/… as free text unless your org standardizes it) |
| `instructions` | often | Prompt body; mirrored into `definition.instructions` if missing there |
| `definition` | optional | Extra JSON-compatible bag for product-specific authoring |
| `assets` | optional | Challenge-scoped files |
| `artifact_refs` | optional | Artifact key references validated against declared artifacts |

### `input_sets` (required)

Defines runnable **cases**:

- **`key`** and **`name`** required on each set.
- Prefer modern **`cases`** array. Legacy `items` aliases are normalized into `cases` during `normalizeBundle`; do not rely on `items` in new authoring.
- All cases in a single input set must reference the same `challenge_key`; use separate input sets for separate challenges.

Cases are modeled by `CaseDefinition` with `challenge_key`, `case_key`, optional rich `inputs`/`expectations`, `artifacts`, `assets`, plus legacy **`payload`** for blob-only authoring.

Deep dive: [Input sets & cases](input-sets-and-cases).

## Execution mode compatibility

Validated in `ValidateBundle`:

When `execution_mode` is **`prompt_eval`**:

- `tools` block must **not** be present
- `version.sandbox` must **not** be present  
- `version.tool_policy` must be **empty**

When **`native`** you may populate sandbox, allowed tool kinds, and composed tools—the worker will hydrate the richer runtime path (`native_executor` flow).

Choose `prompt_eval` for pure model-output workloads; promote to `native` when you rely on sandbox files, primitives, validators that read captured files (`file:` evidence), etc.

## After publish — IDs returned

Publishing returns stable identifiers referenced by runs (see authoring guide):

- `challenge_pack_id`
- `challenge_pack_version_id`
- `evaluation_spec_id`
- `input_set_ids`
- Optional `bundle_artifact_id`

Run creation binds `challenge_pack_version_id` specifically—YAML filenames stop mattering immediately after publish.

## See also

- [Evaluation spec reference](evaluation-spec-reference)
- [Challenge packs and inputs — conceptual overview](../concepts/challenge-packs-and-inputs)

---

# Evaluation spec reference

Validators, metrics, behavioral signals, runtime limits, and scorecard semantics from backend/internal/scoring.

Source: https://www.agentclash.dev/docs/challenge-packs/evaluation-spec-reference
Markdown export: https://www.agentclash.dev/docs-md/challenge-packs/evaluation-spec-reference

Pack-local `evaluation_spec` is unmarshalled into `scoring.EvaluationSpec` (`backend/internal/scoring/spec.go`) via strict decoding. This page maps fields to enums and collectors actually implemented—not aspirational placeholders.

For LLM graders see the dedicated guide: [LLM judges](llm-judges).

## Judge mode (`judge_mode`)

Top-level discriminator on how scoring composes deterministic pieces with judges:

| Value | Constant |
| --- | --- |
| `deterministic` | `JudgeModeDeterministic` |
| `llm_judge` | `JudgeModeLLMJudge` |
| `hybrid` | `JudgeModeHybrid` |

Invalid values fail validation early.

## Validators (`validators[]`)

Each entry is a `ValidatorDeclaration`:

- **`key`** — unique within spec; also forbidden to collide with metrics or judges
- **`type`** — enumerated `ValidatorType` (snippet below)
- **`target`** — evidence reference (validators require supported references—see Evidence references section)
- **`expected_from`** — often required depending on validator type (`RequiresExpectedFrom` in `spec.go`)
- **`config`** — type-specific strict JSON validated in `validation.go`

### Implemented validator types

From `ValidatorType*` constants:

`exact_match`, `contains`, `regex_match`, `json_schema`, `json_path_match`, `boolean_assert`, `fuzzy_match`, `numeric_match`, `normalized_match`, `token_f1`, `math_equivalence`, `bleu_score`, `rouge_score`, `chrf_score`, `file_content_match`, `file_exists`, `file_json_schema`, `directory_structure`, `code_execution`

File-ish validators gate on sandbox artifacts (see **File validators**: `IsFileValidator()` distinguishes these).

Always check `requires_expected_from`: e.g., `file_exists`, `directory_structure`, and `code_execution` can rely on config/paths without `expected_from`.

## Metrics (`metrics[]`)

`MetricDeclaration` requires:

| Field | Notes |
| --- | --- |
| `key` | Unique within spec |
| `type` | `numeric`, `text`, or `boolean` |
| `collector` | Implemented switch in `engine_metrics.go` |
| `unit` | Stored for dashboards/score normalization |

Collectors wired today (verbatim keys):

`run_total_latency_ms`, `run_ttft_ms`, `run_input_tokens`, `run_output_tokens`, `run_total_tokens`, `run_agent_tokens`, `run_race_context_tokens`, `run_model_cost_usd`, `run_completed_successfully`, `run_failure_count`, `run_tool_call_count`, behavioral scores (`behavioral_recovery_score`, … ), `validator_pass_rate`

Declaring a collector that does not exist will fail silently only if evidence missing—prefer copying keys from tests in `backend/internal/scoring/engine_metrics.go`.

## Behavioral panel (`behavioral`)

Optional `behavioral.signals[]` referencing `behavioral.signal` enums:

- `recovery_behavior`
- `exploration_efficiency`
- `error_cascade`
- `scope_adherence`
- `confidence_calibration`

Each signal supports `weight`, optional `gate`, `pass_threshold` for hardened evaluation sessions.

## Post-execution sandbox captures (`post_execution_checks`)

Declare file/directory grabs before sandbox teardown (`post_execution.go`):

| `type` | Meaning |
| --- | --- |
| `file_capture` | Persist file bytes up to configured max |
| `directory_listing` | Snapshot structure |

Captured evidence is exposed to graders through `file:<path>` style references downstream—pair with validators that target those artifacts.

Defaults: ~1 MiB per file, aggregate caps enforced per run (`DefaultMaxFileSizeBytes`, `DefaultMaxTotalCaptureBytes`).

## Scorecard (`scorecard`)

`ScorecardDeclaration` holds:

### Dimensions (`dimensions`)

Each dimension may be a plain string shorthand (historical compatibility) **or** an expanded object specifying:

- **`key`** — dimension name (`correctness`, `latency`, `cost`, `behavioral`, custom)
- **`source`** — dispatcher: `validators`, `metric`, `reliability`, `latency`, `cost`, `behavioral`, `llm_judge`
- **`validators[]`**, **`metric`**, **`judge_key`** — depending on `source`
- **`weight`**, **`normalization`** — linear normalize against target/max envelopes
- **`gate`**, **`pass_threshold`** — hard fail semantics (see Strategies)

Built-in shortcut keys normalize during `normalizeEvaluationSpec` (`validation.go`): correctness/ reliability/ latency / cost / behavioral auto-fill sensible sources.

### Strategy (`strategy`)

| Strategy | Semantics sketch |
| --- | --- |
| `weighted` | Weighted mean; gated dims may still veto pass verdict |
| `binary` | All dimensions treated as gates; scorecard-level `pass_threshold` is rejected (prevents ambiguity) |
| `hybrid` | Gates AND aggregate over non-gate dims must clear optional `scorecard.pass_threshold` |

See doc comments on `ScoringStrategy` in `spec.go` for nuanced behavior—especially hybrid vs weighted gate interplay.

### `scorecard.pass_threshold`

Optional inclusive overall score cutoff (documented extensively in struct comment). Forbidden for pure `binary`.

### Judge budgets (`scorecard.judge_limits`)

Caps LLM-as-judge spend per run (`MaxSamplesPerJudge`, `MaxCallsUSD`, `MaxTokens`). Hard-coded ceilings (`JudgeMaxSamplesCeiling`) still clamp pack-authored overrides.

### Legacy normalization (`normalization`)

`latency.target_ms`, `cost.target_usd` migrate into dimension-level normalization automatically for older specs—still accepted.

## Runtime limits (`runtime_limits`)

`max_total_tokens`, `max_cost_usd`, `max_duration_ms`—enforced upstream of sandbox/model loops; surfaced for UI + scoring fallbacks.

## Pricing (`pricing.models[]`)

Pricing rows describing per-million token economics for **`run_model_cost_usd`** normalization. Matches `ProviderKey`/`ProviderModelID` tuples your workspace deployments actually use—misaligned rows produce weak cost dims but do not invalidate the pack.

## Evidence references validators understand

Validated by `isSupportedEvidenceReference`:

- Absolute shortcuts: `final_output`, `run.final_output`, `challenge_input`, `case.payload`
- Dotted accessors: `case.payload.*`, `case.inputs.*`, `case.expectations.*`, `artifact.*`
- Sandbox artifacts: prefix `file:` with non-empty remainder
- Literals: `literal:…`

Prefer explicit paths whenever you refactor input schema—ambiguous references fail validate instead of drifting silently.

## See also

- [LLM judges](llm-judges)
- [Write a challenge pack](../guides/write-a-challenge-pack)
- Historical v0 evaluation contract notes live in the monorepo file `docs/evaluation/challenge-pack-v0.md` (developer-oriented, not mirrored on the docs site).

---

# LLM judges

LLM-as-judge declarations, rubric/assertion modes, consensus, budgets, evidence wiring—straight from scoring/spec.go and validation_judges.go.

Source: https://www.agentclash.dev/docs/challenge-packs/llm-judges
Markdown export: https://www.agentclash.dev/docs-md/challenge-packs/llm-judges

Agents can be scored by **deterministic validators** alone, pure **LLM judges**, or **`hybrid`** combining both—the tri-state lives in `judge_mode` on `EvaluationSpec` (`backend/internal/scoring/spec.go`).

Judge bodies live in **`evaluation_spec.llm_judges[]`** (`LLMJudgeDeclaration`). Dimensions that consume judges set `source: llm_judge` and **`judge_key`** referencing exactly one declaration (1:1 mapping by design).

## Supported grader modes (`mode`)

| Mode | Typical use |
| --- | --- |
| `rubric` | Structured numeric rubric graded each sample |
| `assertion` | Yes/no factual checks; aggregates via majority/unanimous |
| `n_wise` | Single prompt ranks all competing agents simultaneously |
| `reference` | Rubric calibrated against gold text from resolved evidence |

`IsNumeric` / `IsBooleanScope` helpers govern which consensus math applies.

## Required fields by mode

Validation (`validation_judges.go`) enforces:

- **`rubric`** — non-empty rubric string
- **`reference`** — rubric + `reference_from` evidence reference (must pass `isSupportedEvidenceReference`)
- **`assertion`** — non-empty natural-language assertion; optional `expect` bool flips desired polarity
- **`n_wise`** — non-empty ranking `prompt`; optional `position_debiasing` combats ordering bias across samples

## Model fan-out

Exactly **one** of:

- `model` — single model id string (resolved by worker/provider wiring)
- `models` — non-empty list for multi-model judging

If `len(models) > 1`, you must include **`consensus`** with:

- `aggregation` — `median`, `mean`, `majority_vote`, or `unanimous`
- Optional `min_agreement_threshold`, `flag_on_disagreement`

Boolean-scope modes restrict some aggregations (assertions cannot be mean-averaged nonsensically—validator enforces compatibility).

## Samples & ceilings

- `samples` — per-model repeat count; `0` normalizes to `JudgeDefaultSamples` (3)
- Hard cap `JudgeMaxSamplesCeiling` (10) applied even if the pack requests more—cost attack guard

## Evidence conditioning (`context_from[]`)

Each entry must be a supported evidence reference (same family as validator `target` strings). The workflow evaluator stitches these fragments into the judge envelope before the LLM call.

## Optional controls

| Field | Role |
| --- | --- |
| `output_schema` | JSON Schema for parser validation of model output |
| `score_scale` | `{min,max}` normalization (defaults 1..5 when omitted) |
| `anti_gaming_clauses` | Pack-supplied safety copy **appended** to defaults (never replaces base mitigations) |
| `timeout_ms` | Per-judge activity budget (clamped by outer Temporal activity timeout) |

## Scorecard wiring

1. Declare judges under `llm_judges`.
2. Add a dimension with `source: llm_judge` and `judge_key` matching a judge `key`.
3. Keep **keys unique across validators, metrics, and judges**—collisions are validation errors (namespace collision prevents ambiguous evidence routing).

## Budgets & cost isolation

`scorecard.judge_limits` tracks **judge** spend separately from agent model spend covered by `runtime_limits`. This split is intentional (see Q7 discussion embedded in `JudgeLimits` comments in `spec.go`): agent overages should not hide judge runaway.

When cumulative judge calls exceed configured USD/token budgets, remaining samples downgrade to `unable_to_judge` states feeding scorecard `OutputStateUnavailable` paths.

## Practical authoring tips

- Start with **`rubric`** + single `model` + default samples; add `models`+`consensus` only after deterministic dims stabilise.
- Use **`reference`** when you already store golden answers in `case.expectations` or artifacts—keeps judges aligned to ground truth.
- Assertions excel as **binary gates** (`gate: true` on the dimension) while numeric rubrics express partial credit.

## See also

- [Evaluation spec reference](evaluation-spec-reference)
- [Interpret results](../guides/interpret-results)

---

# Tools, primitives & policy

How tool_policy and tools.custom map to engine primitives in backend/internal/engine and sandbox.ToolPolicy.

Source: https://www.agentclash.dev/docs/challenge-packs/tools-primitives-and-policy
Markdown export: https://www.agentclash.dev/docs-md/challenge-packs/tools-primitives-and-policy

AgentClash stacks **three** tool notions (also summarized in [Tools, network, and secrets](../concepts/tools-network-and-secrets)):

1. **Workspace tool resources** — org-level infrastructure objects (not covered by pack YAML)
2. **Pack composed tools** — `tools.custom[]` entries expanding to JSON Schema + implementation
3. **Engine primitives** — concrete executors registered in `nativePrimitiveTools` (`backend/internal/engine/primitive_tools.go`)

Only (2)+(3) are pack-controlled.

## Tool policy shape

`version.tool_policy` JSON eventually hydrates `sandbox.ToolPolicy` (`backend/internal/sandbox/sandbox.go`):

- **`allowed_tool_kinds`** — list controlling capability groups
- **`allow_shell`** — separate bool gating the `exec` primitive

### Recognized kind strings

Validated set (`supportedToolKinds` in `challengepack/validation.go`):

`browser`, `build`, `data`, `file`, `network`

**Shell is not a kind**—enable it with `allow_shell: true`.

### Empty allowlist semantics

`allowsToolKind` treats an **empty** `allowed_tool_kinds` as “allow everything” (per `primitive_helpers.go`). In practice, prefer explicit lists so validation errors catch typos early.

### Mode guardrails

`prompt_eval` packs **must omit** `tool_policy` entirely—see [Bundle YAML reference](bundle-yaml-reference).

## Built-in primitive names

Declared in `executor_builders.go`, registered in `nativePrimitiveTools`:

| Primitive | Gated by |
| --- | --- |
| `submit` | Always available (final answer) |
| `read_file`, `write_file`, `list_files`, `search_files`, `search_text` | `file` kind |
| `query_json`, `query_sql` | `data` kind |
| `http_request` | `network` kind (+ runtime network flags) |
| `run_tests`, `build` | `build` kind |
| `exec` | `allow_shell` |

Browser tooling exists in policy (`toolKindBrowser`)—ensure your template + worker build includes whatever browser bridge your pack expects before relying on it in production.

## Composed tools (`tools.custom[]`)

Each item:

```yaml
tools:
  custom:
    - name: call_support_api
      description: Fetch ticket JSON
      parameters:
        type: object
        properties:
          ticket_id: { type: string }
        required: [ticket_id]
        additionalProperties: false
      implementation:
        primitive: http_request
        args:
          method: GET
          url: https://api.example.com/tickets/${ticket_id}
          headers:
            Authorization: Bearer ${secrets.SUPPORT_TOKEN}
```

Validation highlights (`validateComposedToolConfig`):

- Non-mock tools require **`implementation.primitive`** not equal to the composed name (prevents self-delegation footgun)
- **`implementation.args`** object required; templates validated for placeholder safety
- Parameters must be JSON Schema passing `templateutil.ValidateToolParameterSchema`
- Custom graph cannot contain **cycles** or depth > 8 delegation jumps

### Mock implementations

Set `implementation.type: mock` to skip primitive resolution—useful for dry-run packs or policy-only testing. Mocks bypass cycle detection.

### Workspace tools vs pack tools

Pack tools are **not** the same records as API `tools` resources—they are bundle-local contracts interpreted entirely inside the worker.

## Secret placeholders

Composed `args` may reference `${secrets.NAME}` which resolve through workspace secret stores—**never** place secret material inline. Sandbox `env_vars` explicitly reject secret placeholders (see native executor sandbox guard) because environment leaks are too easy; prefer header injection on `http_request`.

## Provider visibility

`buildToolRegistry` lifts final OpenAI/Anthropic/etc. tool definitions from the registry’s **visible** map—only tools allowed by policy + manifest appear to the model.

## See also

- [Sandbox & E2B](sandbox-and-e2b) for network pairing with `http_request`
- `backend/internal/challengepack/tools_validation_test.go` for edge-case fixtures

---

# Sandbox & E2B

Pack sandbox fields, worker sandbox provider selection, and how native execution reaches E2B today.

Source: https://www.agentclash.dev/docs/challenge-packs/sandbox-and-e2b
Markdown export: https://www.agentclash.dev/docs-md/challenge-packs/sandbox-and-e2b

Native runs execute agent tool calls inside an isolated **sandbox provider** implementing `sandbox.Provider` (`backend/internal/sandbox/sandbox.go`). Production commonly uses **E2B** (`backend/internal/sandbox/e2b/provider.go`); local development may run **unconfigured** no-op providers so queues drain without real VMs.

## Pack-level `version.sandbox`

Struct `SandboxConfig` (`challengepack/bundle.go`):

| Field | Purpose |
| --- | --- |
| `network_access` | Boolean gate paired with tool policy network tools |
| `network_allowlist` | CIDR strings; invalid entries fail `ValidateBundle` |
| `env_vars` | Literal environment injection into sandbox (secret placeholders rejected during native executor setup) |
| `additional_packages` | APT package names (`aptPackagePattern` validation) |
| `sandbox_template_id` | Optional override of default E2B template id per pack version |

**Remember:** entire `sandbox` block is illegal for `prompt_eval` packs.

## Worker configuration knobs

From environment / `backend/internal/worker/config.go` (mirror of the searchable [Config reference](../reference/config) tables):

| Variable | Effect |
| --- | --- |
| `SANDBOX_PROVIDER` | `e2b` vs `unconfigured` (noop) |
| `E2B_API_KEY` | Credentials |
| `E2B_TEMPLATE_ID` | Default template when pack omits `sandbox_template_id` |
| `E2B_API_BASE_URL` | Optional API override |
| `E2B_REQUEST_TIMEOUT` | HTTP budget for control-plane calls |

Misconfiguration does not rewrite your YAML—it causes clear worker errors or `/doctor` warnings when local sandboxes cannot start.

## Tool policy vs network flags

Even if `http_request` is allowed by `allowed_tool_kinds: [network]`, outbound traffic still respects:

- global sandbox network toggles
- CIDR allowlists
- provider-level enforcement inside E2B machines

Think of **tool policy** as “model may ask” and **sandbox** as “infrastructure may permit”.

## Secrets & environment

Native executor refuses `${secrets.*}` inside `sandbox.env_vars` because files and process listings could leak them; keep secrets in tool args that go through hardened paths (notably `http_request` header sanitation—see `primitive_secrets.go` comments).

## Failure modes to expect

- **Template drift** — changing `additional_packages` without rebuilding templates can cause first-run apt noise; pin templates when stable.
- **Allowlist too tight** — model receives policy errors from `http_request` if DNS resolves but CIDR blocks egress.
- **No provider** — `SANDBOX_PROVIDER=unconfigured` means native runs **do not** execute real tools; useful for API-only integration tests, misleading if you expect live sandboxes.

## See also

- [Architecture — Sandbox layer](../architecture/sandbox-layer)
- [Tools, primitives & policy](tools-primitives-and-policy)

---

# Input sets & cases

How cases bind challenges, structured inputs, expectations, assets, and legacy payloads—grounded in challengepack.CaseDefinition.

Source: https://www.agentclash.dev/docs/challenge-packs/input-sets-and-cases
Markdown export: https://www.agentclash.dev/docs-md/challenge-packs/input-sets-and-cases

Input sets are the unit AgentClash schedules per deployment/candidate. Each `input_sets[]` entry contains **`cases[]`** (`CaseDefinition` in `backend/internal/challengepack/bundle.go`).

## Case identity

- **`challenge_key`** — must reference an existing `challenges[].key`
- **`case_key`** / legacy **`item_key`** — both accepted; normalization duplicates missing side from the other

All cases in one `input_sets[]` entry must reference the same `challenge_key`; split mixed-challenge suites into separate input sets.

`EffectiveKey()` chooses `case_key` when present for stored rows.

## Three authoring styles (coexist)

1. **Legacy payload-only** — fill `payload` map; omit structured inputs/expectations  
2. **Structured eval** — `inputs[]` + `expectations[]` with explicit `kind` fields  
3. **Artifact heavy** — `assets[]` + `artifacts[]` referencing declared version/challenge assets

`IsLegacyPayloadOnly` detects style (1) for storage compatibility.

### Stored document shape

When modern fields exist, `StoredPayload()` marshals `StoredCaseDocument` JSON with `schema_version: 1`, preserving:

- `payload`
- `inputs`
- `expectations`
- `artifacts`
- `assets`

This is what scoring + replay pull back—not the raw YAML fragment.

## Case inputs (`inputs[]`)

`CaseInput` fields:

| Field | Role |
| --- | --- |
| `key` | Stable id for templates / UI |
| `kind` | Drives rendering + validator binding (`text`, `artifact`, etc.—product-specific kinds should match worker expectations) |
| `value` | Inline scalar/object |
| `artifact_key` | Pull bytes from declared asset map |
| `path` | Optional relative path inside asset bundle |

Validators can address values through `case.inputs.<key>` evidence paths.

## Expectations (`expectations[]`)

`CaseExpectation` parallels inputs:

- `key`, `kind`, `value`, `artifact_key`, plus **`source`** telling graders where dynamic gold values originate (`input:prompt` pattern seen in CLI template packs)

Use expectations for:

- deterministic string compares
- supplying LLM judge `reference_from` bindings
- filesystem validators comparing outputs to expected files

## Assets on cases

Case-level `assets[]` references use the same `AssetReference` structure as version-level entries (key, path, optional `artifact_id`). Validation ensures cross-references exist before publish succeeds.

## Input set metadata

Optional `description` on an input set is preserved for UI/discovery; there is no behavioral magic—selection happens by id/key at run creation time.

## Choosing input set at run time

CLI `eval start` accepts `--input-set` when multiple sets exist; otherwise TTY flows prompt. API consumers pass the chosen `input_set_id` when creating runs (see OpenAPI `CreateRun` family).

## See also

- [Bundle YAML reference](bundle-yaml-reference)
- [Evaluation spec — evidence references](evaluation-spec-reference)
- [Artifacts concept](../concepts/artifacts)

---

# Eval workflows & gates

CLI-first eval commands, baselines, scorecards, comparisons, and release gates as implemented in cli/cmd.

Source: https://www.agentclash.dev/docs/challenge-packs/eval-workflows-and-gates
Markdown export: https://www.agentclash.dev/docs-md/challenge-packs/eval-workflows-and-gates

Challenge packs are useless until a **run** binds a `challenge_pack_version_id` to one or more **deployments**. The product ships a workflow-oriented CLI path so you rarely hand-copy UUIDs.

## Happy path commands

From `cli/cmd/eval.go`, `baseline.go`, `compare.go`, `release_gate.go`:

```bash
agentclash eval start --follow
agentclash baseline set [run_id] [--agent <label>]
agentclash eval scorecard [run_id] [--agent <label>] [--json]
agentclash compare runs --baseline <run> --candidate <run>
agentclash compare gate --baseline <run> --candidate <run>
agentclash release-gate list [--baseline ... --candidate ...]
```

### `eval start`

Key flags (see `eval.go` / `eval_resolve.go`):

| Flag | Purpose |
| --- | --- |
| `--pack` | Pack id, slug, or exact name |
| `--pack-version` | Version id or integer |
| `--input-set` | Disambiguates when multiple sets published |
| `--deployment` | Repeatable; accepts id or exact name |
| `--follow` | Stream run events after creation |
| `--scope` | `full` vs `suite_only` regression scoping |
| `--suite` / `--case` | Target regression fixtures |
| `--race-context` | Peer standings injection (multi-agent) |

Non-interactive environments must supply enough disambiguators—resolver errors spell out what’s missing.

### `eval scorecard`

Prints the latest candidate scorecard and, when configured, enriches with **baseline + comparison + release gate** envelopes in one JSON payload for CI (`eval_test.go` asserts those keys).

Omitting `run_id` uses deterministic “latest relevant run” semantics documented in tests—do not assume hidden state in automation; pass explicit ids in CI.

### Baseline bookmarks

`baseline set|show|clear` stores a **workspace-scoped** pointer to a run (and optional specific `run_agent`). This unlocks diff language inside `eval scorecard` without retyping ids.

`doctor` treats missing baseline as **informational** only—CI gates should not fail solely because no baseline exists yet.

## Compare & release gates

- `compare runs` hits comparison APIs with explicit baseline/candidate pair (optional agent ids).  
- `compare gate` posts to `/v1/release-gates/evaluate` with those ids—response includes `policy_snapshot`, `evaluation_details`, timestamps (see `compare.go` long help).
- `release-gate list` surfaces historical evaluations with optional filters.

Gate outcomes use structured status codes (documented in `compare.go` help text) for scripting.

## Relationship to `run create`

`run create` still exists for power users, but product messaging steers new usage to `eval start` (`run.go` long help cross-links). Pick one style per automation story to avoid divergent flag semantics.

## Docs for consumers vs operators

This page is for **people driving hosted or staging workspaces** from the CLI. Self-host operators should still read [Self-host quickstart](../getting-started/self-host) for bringing up Postgres + Temporal + worker parity.

## See also

- [Runs and evals](../concepts/runs-and-evals)
- [Interpret results](../guides/interpret-results)
- [CLI reference](../reference/cli)

---

# Write a Challenge Pack

Author a challenge-pack bundle in YAML, validate it against the current parser, and publish it into a workspace.

Source: https://www.agentclash.dev/docs/guides/write-a-challenge-pack
Markdown export: https://www.agentclash.dev/docs-md/guides/write-a-challenge-pack

Goal: write a pack that the current AgentClash parser, validator, and publish flow will accept.

Prerequisites:

- You have the CLI installed and logged in.
- You linked a workspace with `agentclash link`.
- You know whether the pack should be `prompt_eval` or `native`.

## 1. Start with the CLI scaffold

The fastest way to get a valid starter file is:

```bash
agentclash challenge-pack init support-eval.yaml
```

This generates a minimal bundle that matches the current parser shape. Use `--template native` if you want a native starter instead of the default `prompt_eval`.

## 2. Edit the current minimum shape

This is the smallest honest starting point based on the current bundle parser and tests:

```yaml
pack:
  slug: support-eval
  name: Support Eval
  family: support

version:
  number: 1
  execution_mode: prompt_eval
  evaluation_spec:
    name: support-v1
    version_number: 1
    judge_mode: deterministic
    validators:
      - key: exact
        type: exact_match
        target: final_output
        expected_from: challenge_input
    scorecard:
      dimensions: [correctness]

challenges:
  - key: ticket-1
    title: Ticket One
    category: support
    difficulty: medium
    instructions: |
      Read the request and produce the final answer.

input_sets:
  - key: default
    name: Default Inputs
    cases:
      - challenge_key: ticket-1
        case_key: sample-1
        inputs:
          - key: prompt
            kind: text
            value: hello
        expectations:
          - key: answer
            kind: text
            source: input:prompt
```

This is not a glamorous pack. It is a good pack skeleton because it matches the current parser shape.

Keep every case in a single `input_sets[]` entry pointed at the same `challenge_key`. If a pack covers multiple challenges, split them into separate input sets so each run has one final-output contract to satisfy.

## 3. Add execution policy only when you need it

If the pack is `native`, you can add runtime sections like `tool_policy`, `sandbox`, and `tools`.

Example:

```yaml
version:
  number: 2
  execution_mode: native
  tool_policy:
    allowed_tool_kinds:
      - file
      - shell
      - network
  sandbox:
    network_access: true
    network_allowlist:
      - 203.0.113.0/24
  evaluation_spec:
    name: support-v2
    version_number: 2
    judge_mode: hybrid
    validators:
      - key: exact
        type: exact_match
        target: final_output
        expected_from: challenge_input
    scorecard:
      dimensions: [correctness]

tools:
  custom:
    - name: check_inventory
      description: Check inventory by SKU
      parameters:
        type: object
        properties:
          sku:
            type: string
      implementation:
        primitive: http_request
        args:
          method: GET
          url: https://api.example.com/inventory/${sku}
          headers:
            Authorization: Bearer ${secrets.INVENTORY_API_KEY}
```

Use these sections deliberately.

- `tool_policy` decides what kinds of tools are even available.
- `sandbox.network_access` and `network_allowlist` control outbound networking.
- `tools.custom` defines the tool contract the agent sees.
- `implementation.primitive` picks the executor primitive that actually runs.

## 4. Add assets when inputs should point at files

If the pack needs files, declare them as assets instead of hardcoding mystery paths all over the bundle.

Example:

```yaml
version:
  number: 1
  execution_mode: native
  assets:
    - key: fixtures
      path: fixtures/workspace.zip
      media_type: application/zip
```

You can also back an asset with an uploaded artifact by setting `artifact_id` instead of only relying on a repository path.

Then cases and expectations can refer to those assets by key.

## 5. Validate before you publish

The current CLI command is:

```bash
agentclash challenge-pack validate support-eval.yaml
```

This calls the workspace-scoped validation endpoint and checks the same parser and validation logic the publish path uses.

Typical failures the current code will catch early:

- unknown placeholders like `${missing}`
- invalid CIDR entries in `network_allowlist`
- self-referencing or cyclic composed tools
- invalid tool parameter schemas
- unknown artifact keys or nonexistent stored artifact IDs

## 6. Publish the bundle

Once validation passes:

```bash
agentclash challenge-pack publish support-eval.yaml
```

The publish response returns concrete IDs, including:

- `challenge_pack_id`
- `challenge_pack_version_id`
- `evaluation_spec_id`
- `input_set_ids`
- optional `bundle_artifact_id`

Those IDs matter later because run creation asks for a pack version, not a filename.

## 7. Confirm the workspace can see it

```bash
agentclash challenge-pack list
```

If the pack published cleanly, it should show up in the workspace list with its versions.

## 8. Run it through the workflow-first eval path

```bash
agentclash eval start --follow
agentclash baseline set
agentclash eval scorecard
```

## Verification

You should now have:

- a bundle YAML file the current parser accepts
- a successful `validate` result
- a published pack version ID you can use in run creation
- a workflow path you can reuse without raw IDs for every step

## Troubleshooting

### Validation says a tool placeholder is unknown

Your `implementation.args` template is referencing a variable that is not declared by the tool parameter schema or available template context.

### Validation says a tool references itself or forms a cycle

Your composed tool graph is recursive. Break the cycle and delegate to a primitive or a non-cyclic tool chain.

### Validation says an artifact key is missing

You referenced an asset or artifact key in a case or expectation that was never declared in the pack.

### The pack needs internet access

Do not assume that adding `http_request` is enough. You also need the relevant sandbox/network policy in the pack and runtime path.

## See also

- [Challenge pack reference](../challenge-packs)
- [Challenge Packs and Inputs](../concepts/challenge-packs-and-inputs)
- [Tools, Network, and Secrets](../concepts/tools-network-and-secrets)
- [Artifacts](../concepts/artifacts)
- [CLI Reference](../reference/cli)

---

# Configure Runtime Resources

Create secrets, provider accounts, model aliases, runtime profiles, and deployments in the order the current product expects.

Source: https://www.agentclash.dev/docs/guides/configure-runtime-resources
Markdown export: https://www.agentclash.dev/docs-md/guides/configure-runtime-resources

Goal: assemble the resource chain that turns a ready build version into a runnable deployment.

Prerequisites:

- You already selected a workspace.
- You already have an `agent_build_id` and a ready `build_version_id`.
- You have the provider credential you intend to use.

## 1. Store provider credentials as workspace secrets

If you already want explicit secret management, set the secret first:

```bash
printf '%s' "$OPENAI_API_KEY" | agentclash secret set OPENAI_API_KEY
agentclash secret list
```

The list endpoint returns metadata only. Secret values are not exposed back to you.

## 2. Inspect the model catalog

Model aliases point at model catalog entries, so fetch the catalog first:

```bash
agentclash infra model-catalog list
agentclash infra model-catalog get <MODEL_CATALOG_ENTRY_ID>
```

This gives you the model entry ID you will use when creating the alias.

## 3. Create a provider account

You have two current patterns.

### Pattern A: reference an existing workspace secret

`provider-account.json`:

```json
{
  "provider_key": "openai",
  "name": "OpenAI Workspace Account",
  "credential_reference": "workspace-secret://OPENAI_API_KEY",
  "limits_config": {
    "rpm": 60
  }
}
```

Create it:

```bash
agentclash infra provider-account create --from-file provider-account.json
```

### Pattern B: pass `api_key` directly on creation

```json
{
  "provider_key": "openai",
  "name": "OpenAI Workspace Account",
  "api_key": "<PASTE_KEY_HERE>"
}
```

The current infrastructure manager does not keep that raw value on the account row. It stores the key as a workspace secret and rewrites the provider account to use a `workspace-secret://...` credential reference automatically.

## 4. Create a runtime profile

A runtime profile controls execution target and limits.

`runtime-profile.json`:

```json
{
  "name": "default-native",
  "execution_target": "native",
  "trace_mode": "full",
  "max_iterations": 24,
  "max_tool_calls": 32,
  "step_timeout_seconds": 120,
  "run_timeout_seconds": 1800,
  "profile_config": {
    "sandbox": {
      "allow_shell": true,
      "allow_network": false
    }
  }
}
```

Create it:

```bash
agentclash infra runtime-profile create --from-file runtime-profile.json
```

## 5. Create a model alias

A model alias gives the workspace a stable handle for one model catalog entry.

`model-alias.json`:

```json
{
  "alias_key": "primary-chat",
  "display_name": "Primary Chat Model",
  "model_catalog_entry_id": "<MODEL_CATALOG_ENTRY_ID>",
  "provider_account_id": "<PROVIDER_ACCOUNT_ID>"
}
```

Create it:

```bash
agentclash infra model-alias create --from-file model-alias.json
```

Use aliases when you want deployment configuration and playgrounds to refer to a stable workspace label instead of a raw provider model identifier.

## 6. Create the deployment

The current deployment create contract requires:

- `name`
- `agent_build_id`
- `build_version_id`
- `runtime_profile_id`

Optional but commonly useful:

- `provider_account_id`
- `model_alias_id`

Fast path with flags:

```bash
agentclash deployment create \
  --name support-bot-prod \
  --agent-build-id <AGENT_BUILD_ID> \
  --build-version-id <BUILD_VERSION_ID> \
  --runtime-profile-id <RUNTIME_PROFILE_ID> \
  --provider-account-id <PROVIDER_ACCOUNT_ID> \
  --model-alias-id <MODEL_ALIAS_ID>
```

JSON-file path if you want the full request shape:

```json
{
  "name": "support-bot-prod",
  "agent_build_id": "<AGENT_BUILD_ID>",
  "build_version_id": "<BUILD_VERSION_ID>",
  "runtime_profile_id": "<RUNTIME_PROFILE_ID>",
  "provider_account_id": "<PROVIDER_ACCOUNT_ID>",
  "model_alias_id": "<MODEL_ALIAS_ID>",
  "deployment_config": {}
}
```

Then:

```bash
agentclash deployment create --from-file deployment.json
```

## 7. List what you created

```bash
agentclash infra runtime-profile list
agentclash infra provider-account list
agentclash infra model-alias list
agentclash deployment list
```

At that point the workspace has a real runnable target the run-creation flow can select.

## Where tools fit

Workspace tools are their own infra resource surface:

```bash
agentclash infra tool list
agentclash infra tool create --from-file tool.json
```

That is separate from pack-defined composed tools. Do not mix those up in your mental model.

## Verification

You should now have:

- a workspace secret for provider credentials
- a provider account that resolves credentials indirectly
- a runtime profile defining execution limits
- a model alias pointing at a model catalog entry
- a deployment that can be selected during run creation

## Troubleshooting

### Deployment creation fails because the build version is not deployable

The current API requires a ready build version. Mark the build version ready before deploying it.

### I do not know which model alias to create

Start from `agentclash infra model-catalog list`, then create the alias only after you know which catalog entry and provider account you want to bind.

### I passed an API key directly and now cannot see it again

That is expected. Raw provider keys are stored as workspace secrets and the account keeps only a credential reference.

## See also

- [Agents and Deployments](../concepts/agents-and-deployments)
- [Tools, Network, and Secrets](../concepts/tools-network-and-secrets)
- [Config Reference](../reference/config)
- [CLI Reference](../reference/cli)

---

# Interpret Results

Read AgentClash run output from top-line score to raw replay evidence without getting lost.

Source: https://www.agentclash.dev/docs/guides/interpret-results
Markdown export: https://www.agentclash.dev/docs-md/guides/interpret-results

Goal: turn a finished run or eval into a decision you can defend.

Prerequisites:

- You have a run or eval to inspect.
- You understand the basic difference between a run and an eval.
- You know which challenge pack or workload the result came from.

## Start with the top-line state

Before you read the full timeline, answer three simple questions:

1. Did the run complete, fail, or time out?
2. Which deployment produced the result?
3. Which challenge pack or input set was this run judged against?

If you skip this step, you will mix together infrastructure problems, workload problems, and actual agent regressions.

## Read the score before the trace

The scorecard or summary view is the fastest way to orient yourself. Use it to identify:

- the overall outcome
- the dimension that changed since the last comparable run
- any obvious outlier input or scenario
- whether the run generated enough evidence to trust the outcome

> Info: A score change is only actionable when the underlying workload is comparable.
> Always confirm you are looking at the same deployment class and challenge pack.

## Use the replay timeline to explain the result

Once you know what changed, move to the replay or event timeline.

A useful reading order is:

1. find the first non-trivial event after run start
2. follow tool calls or sandbox transitions in order
3. locate the first irreversible failure or divergence
4. inspect any terminal event that explains why scoring ended where it did

You are looking for the earliest point where the run stopped being healthy. That might be an agent reasoning mistake, but it might just as easily be a sandbox issue, a bad callback, or a missing artifact.

## Separate agent failures from platform failures

This distinction matters for every comparison review.

Treat these as different buckets:

- agent failure: the deployment ran, but the behavior was wrong or weak
- scenario failure: the workload or scoring context exposed a gap or ambiguity
- platform failure: orchestration, sandbox, callback, artifact, or infrastructure issues broke the run

Only the first bucket should drive model or prompt claims directly.

## Compare runs only after you trust each side

When you compare two runs, make sure both have:

- the same or intentionally different deployment target
- the same workload definition
- enough replay evidence to justify the score
- no obvious infrastructure corruption hiding behind the final state

If one side is missing replay evidence or has an incomplete artifact trail, the comparison is weak even if the ranking UI still renders.

## What to do with a useful failure

A good failure is not just something to fix. It is something to preserve.

When a run reveals a real gap:

1. capture the replay and artifacts that make the issue obvious
2. tie the failure back to the scenario or input that exposed it
3. promote it into a repeatable challenge-pack case when the product surface supports it
4. rerun after the fix so the score change is evidence-backed, not anecdotal

That is the core loop behind serious evaluation work.

## Verification

You should now be able to look at one run and answer:

- what failed
- where it failed first
- whether the failure belongs to the agent, workload, or platform
- what evidence you would preserve for future regression testing

## Troubleshooting

### The score changed, but I cannot explain why

Open the replay view and find the first event where the run diverged from the baseline. If you cannot find one, the run may be missing evidence or you may be comparing different workloads.

### The run failed before any meaningful agent work happened

Treat that as a platform or setup issue first. Check orchestration, sandbox configuration, callbacks, and artifacts before concluding the deployment regressed.

### Two runs disagree, but both look messy

Do not force a ranking conclusion. Clean the workload, rerun under the same conditions, and compare again.

## See also

- [Replay and Scorecards](../concepts/replay-and-scorecards)
- [Runs and Evals](../concepts/runs-and-evals)
- [Evidence Loop](../architecture/evidence-loop)
- [First Eval](../getting-started/first-eval)

---

# CI/CD Agent Gates

Use a repo-tracked AgentClash CI manifest to define which agent revision, workload, baseline, and gate a pull request should run.

Source: https://www.agentclash.dev/docs/guides/ci-cd-agent-gates
Markdown export: https://www.agentclash.dev/docs-md/guides/ci-cd-agent-gates

AgentClash CI should gate an agent revision, not only a prompt diff.

Prompt-focused tools can usually watch `prompts/**` and rerun a prompt eval. AgentClash's main product model is richer: an agent change can touch instructions, workflow code, tool bindings, model aliases, runtime limits, output schemas, guardrails, or retrieval configuration. The CI contract therefore needs to name the candidate agent build, deployment settings, challenge workload, baseline, and gate policy explicitly.

## The manifest is the contract

Create a repo-tracked manifest:

```bash
agentclash ci init .agentclash/ci.yaml
agentclash ci validate .agentclash/ci.yaml
agentclash ci validate .agentclash/ci.yaml --remote --json
agentclash ci baseline --manifest .agentclash/ci.yaml --json
agentclash ci should-run --changed-file prompts/system.md --json
```

The generated manifest has this shape:

```yaml
version: 1
trigger:
  paths:
    - .agentclash/agent.json
    - prompts/**
    - tools/**
  labels:
    - agentclash/eval
candidate:
  build:
    agent_build_id: 00000000-0000-0000-0000-000000000001
    spec_file: .agentclash/agent.json
  deployment:
    name: pr-candidate
    runtime_profile_id: 00000000-0000-0000-0000-000000000002
    provider_account_id: 00000000-0000-0000-0000-000000000003
    model_alias_id: 00000000-0000-0000-0000-000000000004
evaluation:
  challenge_pack_version_id: 00000000-0000-0000-0000-000000000005
  input_set_id: 00000000-0000-0000-0000-000000000006
  regression_suites:
    - 00000000-0000-0000-0000-000000000007
baseline:
  run_id: 00000000-0000-0000-0000-000000000008
  refresh: manual
  max_age_days: 30
gate:
  fail_on: regression
regressions:
  promote_failures: proposed
```

The IDs in the generated file are placeholders. Replace them with workspace resources before using the manifest for a real gate.

Local validation is always offline. Add `--remote` when you want the CLI to call the AgentClash API and verify that the manifest's agent build, runtime profile, provider account, model alias, challenge pack version, input set, regression suites or cases, and baseline are visible from the selected workspace. Because this makes real authenticated API calls, set `AGENTCLASH_API_URL`, `AGENTCLASH_TOKEN`, and `AGENTCLASH_WORKSPACE` in CI and expect normal API latency, rate limits, and token scoping rules. JSON output includes a `remote.checks[]` entry per referenced field, so CI can report whether a failure came from the local manifest contract or from an API/resource check.

## What each section means

- `trigger` says which repository paths and optional labels should cause the workflow to run.
- `candidate.build` names the existing AgentClash build and the source-backed build-version spec to test.
- `candidate.deployment` names the runtime resources used for the candidate deployment.
- `evaluation` names the workload: challenge pack version, optional input set, and optional regression suites or cases.
- `baseline` names the locked reference run or deployment, plus explicit refresh and staleness rules.
- `gate` names the release-gate failure threshold.
- `regressions` controls whether failed cases should only be reported, proposed for promotion, or eventually auto-promoted on main.

The important distinction is:

```text
agent build/deployment = thing under test
challenge pack/regression suite = workload used to test it
release gate = decision policy
```

If you are deciding what the workload should contain, use [CI/CD Workload Recipes](https://www.agentclash.dev/docs-md/guides/ci-cd-workload-recipes) for coding, research, support/ops, and long-horizon agent patterns.

## Baseline strategy and refresh

For pull request gates, prefer `baseline.run_id`. It pins the exact accepted mainline run, so every reviewer can see what changed when the baseline moves. Add `baseline.run_agent_id` only when the locked run has multiple participants and the gate must compare against one specific agent lane.

Use `baseline.deployment_id` only when the team intentionally wants a moving selector. `agentclash ci baseline` resolves it to the newest completed run in the workspace that matches the manifest workload and includes that deployment. The command prints the exact resolved `run_id` and `run_agent_id` so downstream automation still compares against concrete IDs.

Use `baseline.max_age_days` when a stale baseline should block the gate. The resolver checks the chosen run's `finished_at` or `created_at` timestamp and fails instead of silently comparing against old behavior.

Refreshes are explicit:

```yaml
baseline:
  run_id: 00000000-0000-0000-0000-000000000008
  refresh: manual
  max_age_days: 30
```

- `manual`: after a successful mainline run, update `baseline.run_id` in a reviewed change.
- `propose`: automation may propose the new baseline, but a human still reviews the manifest change.
- `auto_on_main`: a protected mainline workflow may update the manifest with an auditable commit after the gate passes.

Resolve the baseline before running a gate:

```bash
agentclash ci baseline \
  --manifest .agentclash/ci.yaml \
  --json
```

The JSON includes `strategy`, `source`, `baseline.run_id`, optional `baseline.run_agent_id`, `refresh.mode`, and `refresh.next_action`.

## Decide whether CI should run

Use `agentclash ci should-run` when you want AgentClash to explain whether a pull request touches the agent contract. A matching path or label produces `should_run: true`; unrelated docs-only changes produce `should_run: false`.

```bash
agentclash ci should-run \
  --manifest .agentclash/ci.yaml \
  --changed-file prompts/system.md
```

Labels can force the gate even when paths do not match:

```bash
agentclash ci should-run \
  --manifest .agentclash/ci.yaml \
  --changed-file docs/readme.md \
  --labels agentclash/eval \
  --json
```

In GitHub Actions, `ci should-run` reads pull request labels from `GITHUB_EVENT_PATH` automatically when `--labels` is omitted, and the bundled action passes that behavior through. Use `--github-event <path>` only when testing a saved event payload locally.

For local or GitHub Actions diffing, pass refs explicitly:

```bash
agentclash ci should-run \
  --manifest .agentclash/ci.yaml \
  --base origin/main \
  --head HEAD \
  --json
```

## GitHub Actions sketch

The manifest is the single source of truth for the candidate revision, workload, baseline, and gate. A pull request workflow can validate it, decide whether it should run for the changed files, then let `agentclash ci run` create the candidate build version, deployment, run, and release-gate evaluation.

Use the reusable AgentClash action when you want the standard GitHub integration without rewriting the shell glue:

```yaml
name: AgentClash gate

on:
  pull_request:
    paths:
      - ".agentclash/**"
      - "prompts/**"
      - "tools/**"

jobs:
  agentclash:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - uses: actions/setup-node@v4
        with:
          node-version: "22"

      - name: Run AgentClash CI gate
        id: agentclash
        uses: agentclash/agentclash/.github/actions/agentclash-ci@main
        with:
          token: ${{ secrets.AGENTCLASH_TOKEN }}
          workspace: ${{ secrets.AGENTCLASH_WORKSPACE }}
          manifest: .agentclash/ci.yaml

      - name: Upload AgentClash gate artifacts
        if: always() && steps.agentclash.outputs['should-run'] == 'true'
        uses: actions/upload-artifact@v4
        with:
          name: agentclash-ci
          path: |
            ${{ steps.agentclash.outputs.result-file }}
            ${{ steps.agentclash.outputs.artifact-dir }}/*.json
```

The action installs the published `agentclash` npm package by default, runs `ci validate --remote`, runs `ci should-run`, auto-detects pull request labels from the GitHub event payload, skips unrelated changes, runs `ci run` when matched, posts or updates a sticky structured PR comment when pull request context is available, and exposes `should-run`, `skip-reason`, `run-id`, `gate-verdict`, `exit-code`, `result-file`, and `artifact-dir` outputs. It preserves the CLI exit code, so a blocking gate fails the workflow normally. Grant `pull-requests: write` when you want GitHub-hosted PR comments; commenting is best-effort and permission failures do not override the AgentClash result. When run metadata is available, the comment links reviewers directly to the AgentClash candidate run, baseline run, comparison, failures, scorecard, replay, and regression cases. If setup fails before a candidate run is created, the sticky comment reports the errored setup state and points reviewers at the GitHub Actions log.

`agentclash ci run` exits nonzero when the gate verdict should block CI, when the candidate run times out, or when the manifest/API setup is invalid. In GitHub Actions, it automatically attaches repository, pull request, branch, default branch, commit, workflow, event, and workflow-run URL metadata to the AgentClash run. It also appends a reviewer-friendly Markdown section when the `$GITHUB_STEP_SUMMARY` environment variable is set, while the bundled action turns the same run evidence into the PR comment. Pass `--summary-file <path>` for another Markdown destination, or `--github-step-summary=false` to disable the automatic GitHub summary.

`--artifact-dir` writes stable JSON files intended for `actions/upload-artifact`: `result.json` for the final CLI envelope, `run.json` for run creation/completion payloads, `scorecard.json` for candidate scorecard evidence, `comparison.json` for baseline/candidate comparison evidence, and `gate.json` for the release-gate verdict and policy metadata. The summary and artifacts include the challenge pack version, baseline, candidate, policy, verdict, top evidence lines, regression candidate promotion outcomes, and AgentClash links when the API returns them. Use `--ci-repository`, `--ci-pull-request`, `--ci-branch`, `--ci-default-branch`, `--ci-commit`, and the other `--ci-*` flags when running from another CI system or a custom wrapper.

## Regression promotion policy

Do not auto-promote every PR failure by default. A bad run, flaky dependency, or weak evaluator could pollute the regression suite. Use this conservative progression:

```yaml
regressions:
  promote_failures: disabled
```

Report failures only. When the gate fails, the CLI records that promotion was skipped and does not call failure-listing or promotion endpoints.

```yaml
regressions:
  promote_failures: proposed
```

Create reviewable candidates after a failing gate. The CLI lists the candidate run's failure-review items, checks each target suite from `evaluation.regression_suites`, skips any challenge identity that already has a non-archived/non-rejected case, then calls the promote-failure API with `status: proposed`.

Proposed cases appear in the regression suite UI without entering future runs. A reviewer can accept them by changing status to `active`, or reject/archive them if the failure is noisy, duplicated, or not worth keeping.

```yaml
regressions:
  promote_failures: auto_on_main
```

Create active cases only from protected default-branch runs. The CLI refuses pull request events, `refs/pull/*`, missing default branch metadata, and non-default branches. GitHub Actions usually supplies the default branch through the event payload; custom CI wrappers should pass `--ci-default-branch main`.

All modes preserve the original gate exit code. Promotion errors are reported in the human output, JSON `regression_promotions.errors`, GitHub step summary, and artifact `result.json`, but a blocking regression still exits with the gate failure code.

## Current limits

- `agentclash ci validate` validates the manifest shape locally; pass `--remote` for API-backed resource checks.
- `agentclash ci should-run` only decides whether a gate should run; `agentclash ci run` performs the orchestration.
- `agentclash ci run` creates a one-off candidate deployment for the manifest build version; cleanup/retention policy is still a follow-up.
- GitHub Check Runs with rich annotations are still follow-up work; use the sticky PR comment, GitHub step summary, and uploaded JSON artifacts today.
- Regression candidate promotion requires `evaluation.regression_suites`; without at least one target suite, `ci run` reports promotion as blocked.

## See also

- [Agents and Deployments](https://www.agentclash.dev/docs-md/concepts/agents-and-deployments)
- [Challenge Packs and Inputs](https://www.agentclash.dev/docs-md/concepts/challenge-packs-and-inputs)
- [Eval Workflows and Gates](https://www.agentclash.dev/docs-md/challenge-packs/eval-workflows-and-gates)
- [CI/CD Workload Recipes](https://www.agentclash.dev/docs-md/guides/ci-cd-workload-recipes)

---

# CI/CD Workload Recipes

Choose realistic AgentClash CI workloads for coding, research, support, ops, and long-horizon agents.

Source: https://www.agentclash.dev/docs/guides/ci-cd-workload-recipes
Markdown export: https://www.agentclash.dev/docs-md/guides/ci-cd-workload-recipes

AgentClash CI should answer one question for a pull request: did this agent revision remain good enough against the workload that matters?

The agent is the thing under test. In a manifest, that means the candidate build, deployment, runtime resources, model alias, provider account, tools, schemas, policies, prompts, and workflow code. The challenge pack or regression suite is the workload used to test it. The release gate is the decision policy that turns the run into pass, warn, or fail.

Use this page after you have read the [CI/CD Agent Gates](https://www.agentclash.dev/docs-md/guides/ci-cd-agent-gates) manifest guide. It focuses on what teams should actually evaluate when their agents are more complex than a single prompt.

## Broad Packs And Regression Suites

Use a broad challenge pack when you need confidence that the agent still performs its main job across representative scenarios. Broad packs should cover normal work, edge cases, tool use, policy boundaries, and output quality.

Use a narrow regression suite when you need to lock failures that have already happened. Regression suites should be small, high-signal, and cheap enough to run often. They are strongest when they include failure evidence, a stable expected behavior, and a clear reason the case matters.

Most CI setups need both:

- Pull requests run a smoke challenge pack plus critical regressions.
- Nightly or mainline workflows run the full challenge pack plus larger regression suites.
- Expensive long-horizon tasks run on labels, scheduled workflows, or release branches.

When `regressions.promote_failures: proposed` is enabled, failing gates can add reviewable candidates to the suites listed in `evaluation.regression_suites`. Treat those proposals as a queue, not as automatic truth: accept cases that represent durable agent risk, reject noisy or duplicate failures, and keep the suite small enough that engineers still trust the gate.

## Coding Agent Recipe

Coding agents change when prompts, repository skills, tool policy, model aliases, sandbox settings, or patch-generation code changes. A useful CI workload should exercise the agent's ability to inspect a repo, edit files, run tests, and stop safely.

| Decision | Recommended shape |
| --- | --- |
| Watched paths | `.agentclash/agent.json`, `.agentclash/ci.yaml`, `prompts/**`, `tools/**`, `skills/**`, `evals/coding/**`, sandbox templates, patch validators, model alias config |
| Candidate resources | Candidate agent build spec, deployment, runtime profile with shell access, provider account, model alias, repository fixture artifact, tool policy |
| Workload type | Small deterministic coding tasks in a challenge pack, plus regression cases for previously broken diffs |
| Baseline strategy | Lock a known-good mainline run for the same repo fixture and update it only after a green mainline run |
| Gate policy | Fail on correctness regression, invalid patch, missing tests, unsafe command use, timeout, or material score drop; warn on latency/cost drift |

Start with tasks that have objective validation:

- update a small API and make unit tests pass
- fix a failing parser with a constrained fixture
- modify a config file and preserve unrelated formatting
- refuse a task that asks for destructive commands outside policy

The pack should capture the repository fixture, expected patch behavior, allowed commands, timeout, and validation command. Regression cases should come from concrete failures such as editing the wrong file, skipping tests, breaking an unrelated API, or looping after a failed command.

## Research Agent Recipe

Research agents change when retrieval settings, browsing tools, citation prompts, source filters, answer schemas, or model aliases change. The workload should measure whether the agent can triangulate sources, handle contradictions, and cite evidence without overclaiming.

| Decision | Recommended shape |
| --- | --- |
| Watched paths | `prompts/research/**`, `retrieval/**`, `tools/search/**`, `schemas/research-output.json`, source allowlists, model alias config |
| Candidate resources | Candidate deployment, retrieval profile, browser/search tool policy, provider account, model alias, output schema |
| Workload type | Challenge pack with time-bounded research questions and expected evidence properties; regression suite for past hallucinations or citation failures |
| Baseline strategy | Compare against a stable deployment run on the same source snapshot or same controlled source corpus |
| Gate policy | Fail on unsupported claims, missing citations, fabricated citations, ignored contradictions, schema violations, or unsafe source use |

Good tasks ask for decisions that require evidence:

- compare two vendors and cite primary documentation
- summarize a policy change while distinguishing effective dates
- answer a question with conflicting sources and explain uncertainty
- refuse to infer facts not present in the allowed source set

For CI, prefer source snapshots or controlled fixtures when possible. Live web tasks are useful in nightly runs, but PR gates should avoid failures caused by normal web drift unless the agent's job is specifically live research.

## Support And Ops Agent Recipe

Support and ops agents change when escalation rules, tool bindings, PII policy, ticket schemas, incident workflows, or account-access rules change. The workload should verify tool-call correctness and policy adherence before conversational style.

| Decision | Recommended shape |
| --- | --- |
| Watched paths | `prompts/support/**`, `policies/**`, `tools/crm/**`, `tools/ticketing/**`, `schemas/ticket*.json`, escalation rules, model alias config |
| Candidate resources | Candidate deployment, mocked CRM/ticketing tools, secret references, runtime profile without broad network, output schema |
| Workload type | Challenge pack with mocked tool calls, structured outputs, escalation scenarios, and safety/PII cases |
| Baseline strategy | Lock the current production deployment or last accepted mainline run for the same tool mock version |
| Gate policy | Fail on wrong tool arguments, unauthorized action, missed escalation, PII leak, schema violation, or unsafe automation; warn on tone/style deltas |

Useful cases include:

- refund request that must call the correct account lookup before action
- angry customer that needs escalation rather than policy invention
- incident triage that must create a ticket with the right severity and owner
- request containing PII that must be redacted in summaries and logs

Keep the tool layer mocked and deterministic in PR CI. Production-like integrations are better for staging or scheduled validation, where external service noise will not block every pull request.

## Long-Horizon Agent Recipe

Long-horizon agents are expensive and nondeterministic enough that one run is rarely enough. Their CI should be tiered: fast smoke checks on every relevant pull request, broader repeated runs on main or release branches, and deep suites on schedule.

| Decision | Recommended shape |
| --- | --- |
| Watched paths | Agent orchestration code, planning prompts, tool policy, memory/retrieval config, runtime limits, model aliases, environment templates |
| Candidate resources | Candidate deployment, runtime profile with explicit timeout/tool-call limits, model alias, workload artifacts, optional regression suites |
| Workload type | Smoke challenge pack for PRs; full challenge pack plus high-severity regressions for main; repeated long tasks for scheduled runs |
| Baseline strategy | Lock a baseline deployment or run series, not a single lucky pass; refresh after a successful mainline batch |
| Gate policy | Fail PR smoke on deterministic blockers; use pass-rate, repeated-run, or confidence thresholds for longer suites; warn when evidence is insufficient |

For long-horizon agents, track both optimistic and pessimistic reliability:

- `pass@k`: at least one attempt succeeds across `k` tries
- `pass^k`: every attempt succeeds across `k` tries

PR gates should usually run a small number of deterministic smoke tasks. Nightly gates can run repeated trials, larger fixtures, and statistical checks. A failure should become a regression candidate only when it reproduces often enough to be signal instead of noise.

## Choosing The First CI Workload

Start smaller than the final eval strategy:

1. Pick one agent deployment that matters.
2. Lock one baseline run from a known-good mainline revision.
3. Choose a smoke challenge pack with 3 to 10 high-signal cases.
4. Add only the top production or staging regressions.
5. Fail on clear correctness or policy regressions; warn on cost and latency until the gate earns trust.

Then expand:

- Add broad coverage when the smoke gate is stable.
- Promote repeated failures into regression suites after review.
- Split PR, mainline, nightly, and release workloads by cost and confidence.
- Refresh baselines explicitly, never as a side effect of an arbitrary PR.

Use `auto_on_main` only after this review loop is boring. It creates active regression cases from default-branch failures, while pull requests should normally stay on `proposed` so reviewers can decide whether a failure is a real product regression or just an unstable eval.

## Manifest Example

This manifest watches the coding-agent surface, runs a smoke pack plus critical regressions, and compares the candidate against a locked baseline run:

```yaml
version: 1
trigger:
  paths:
    - .agentclash/agent.json
    - .agentclash/ci.yaml
    - prompts/coding/**
    - tools/repo/**
    - skills/coding/**
  labels:
    - agentclash/eval
candidate:
  build:
    agent_build_id: 00000000-0000-0000-0000-000000000001
    spec_file: .agentclash/agent.json
  deployment:
    name: pr-coding-agent
    runtime_profile_id: 00000000-0000-0000-0000-000000000002
    provider_account_id: 00000000-0000-0000-0000-000000000003
    model_alias_id: 00000000-0000-0000-0000-000000000004
evaluation:
  challenge_pack_version_id: 00000000-0000-0000-0000-000000000005
  input_set_id: 00000000-0000-0000-0000-000000000006
  regression_suites:
    - 00000000-0000-0000-0000-000000000007
baseline:
  run_id: 00000000-0000-0000-0000-000000000008
  refresh: manual
  max_age_days: 30
gate:
  fail_on: regression
regressions:
  promote_failures: proposed
```

The exact IDs should come from your AgentClash workspace. The process should be explicit: update the candidate when the agent changes, update the workload when the eval strategy changes, and update the baseline only after the new mainline behavior is accepted. Use `agentclash ci baseline --manifest .agentclash/ci.yaml --json` in CI to print the exact baseline run used and why.

## See Also

- [CI/CD Agent Gates](https://www.agentclash.dev/docs-md/guides/ci-cd-agent-gates)
- [Write a Challenge Pack](https://www.agentclash.dev/docs-md/guides/write-a-challenge-pack)
- [Challenge Pack Documentation](https://www.agentclash.dev/docs-md/challenge-packs)
- [Eval Workflows and Gates](https://www.agentclash.dev/docs-md/challenge-packs/eval-workflows-and-gates)
- [Agents and Deployments](https://www.agentclash.dev/docs-md/concepts/agents-and-deployments)

---

# Use with AI Tools

Feed AgentClash docs into ChatGPT, Codex, Claude Code, and similar tools using llms.txt and markdown exports.

Source: https://www.agentclash.dev/docs/guides/use-with-ai-tools
Markdown export: https://www.agentclash.dev/docs-md/guides/use-with-ai-tools

Goal: give an assistant or coding agent enough AgentClash context to answer questions, draft workflows, or transform the docs into internal runbooks.

Prerequisites:

- You can open or paste URLs into the assistant you are using.
- You know whether you want the full docs bundle or just one page.

## Pick the right docs export

AgentClash now exposes three AI-friendly surfaces:

- `/llms.txt`: a compact index of the shipped docs set
- `/llms-full.txt`: one bundled markdown-oriented export of the full docs corpus
- `/docs-md/...`: page-level markdown exports that mirror `/docs/...`

Use them differently:

- use `llms.txt` when the tool needs a map first
- use `llms-full.txt` when you want one-shot context for a larger prompt
- use `/docs-md/...` when you only need one focused page, like quickstart or config

## Fastest workflow in ChatGPT, Codex, or Claude Code

1. Start with `https://agentclash.dev/llms.txt`.
2. If the tool can fetch URLs directly, give it that URL first.
3. If the tool cannot fetch URLs, open the file yourself and paste the contents.
4. Ask the tool which page it needs next.
5. Feed the relevant `/docs-md/...` page or the full bundle, depending on scope.

That keeps the context tight. Do not dump the full bundle into every prompt by default.

## Good prompt patterns

Use prompts that ask the model to stay anchored to the supplied docs. Examples:

```text
Using https://agentclash.dev/llms.txt, tell me which docs pages I should read to self-host AgentClash and understand the worker architecture.
```

```text
Use the markdown from https://agentclash.dev/docs-md/guides/interpret-results and turn it into a short incident-review checklist for my eval team.
```

```text
Use https://agentclash.dev/llms-full.txt as the product docs corpus and answer only from that material: what is the difference between a run, an eval, and a challenge pack?
```

## When to use page-level exports instead of the full bundle

Prefer page-level exports when:

- you are debugging one subsystem
- you want tighter answers with lower token cost
- the assistant tends to over-generalize when given too much context

Prefer the full bundle when:

- you want a holistic onboarding summary
- you are asking for a docs-wide rewrite or glossary
- you are building internal retrieval or indexing pipelines

## Verification

You should now be able to hand any of these URLs to a tool and get grounded answers:

- `https://agentclash.dev/llms.txt`
- `https://agentclash.dev/llms-full.txt`
- `https://agentclash.dev/docs-md/getting-started/quickstart`

## Troubleshooting

### The assistant cannot open URLs

Open the relevant endpoint yourself and paste the content directly.

### The answer is too vague

Use a narrower `/docs-md/...` page instead of the full bundle.

### The answer mixes product claims with guesses

Tell the tool to answer only from the supplied docs export and cite the page title it used.

## See also

- [Quickstart](../getting-started/quickstart)
- [Config Reference](../reference/config)
- [Codebase Tour](../contributing/codebase-tour)

---

# Agent Skills

Copyable AgentClash skills for coding agents, exposed as docs pages and markdown exports.

Source: https://www.agentclash.dev/docs/agent-skills
Markdown export: https://www.agentclash.dev/docs-md/agent-skills

AgentClash ships portable Agent Skills for coding agents that understand the `SKILL.md` folder format. The canonical source lives in `web/content/agent-skills/.../SKILL.md`; docs pages and markdown exports are generated from that source.

## Install Targets

- Codex: copy a skill folder into `.agents/skills/<skill>/SKILL.md` or point Codex at the markdown export.
- Claude Code: copy a skill folder into `.claude/skills/<skill>/SKILL.md`; if the repo already uses `AGENTS.md`, add a `CLAUDE.md` import for `@AGENTS.md`.
- Cursor: use these pages as agent-requested rule references, or add thin `.cursor/rules/*.mdc` stubs that link to the matching markdown export.
- Generic agents: fetch `/llms.txt`, `/llms-full.txt`, or the individual `/docs-md/agent-skills/<skill>` pages.

## Core Operating Skills

- [agentclash-ci-release-gate](https://www.agentclash.dev/docs-md/agent-skills/agentclash-ci-release-gate) - ci: Use when wiring AgentClash manifest-based CI gates, deciding whether a PR should run AgentClash, resolving baselines, running `agentclash ci run`, interpreting gate exit codes, collecting CI artifacts, or configuring regression promotion policy in GitHub Actions.
- [agentclash-cli-setup](https://www.agentclash.dev/docs-md/agent-skills/agentclash-cli-setup) - setup: Use when configuring the AgentClash CLI, authenticating with device login or tokens, selecting a workspace, saving default config with link, creating project config with init, resolving API URL precedence, or diagnosing CLI access against production, local, or self-hosted backends.
- [agentclash-eval-runner](https://www.agentclash.dev/docs-md/agent-skills/agentclash-eval-runner) - running: Use when starting, following, inspecting, or reporting AgentClash eval runs with the CLI, especially eval start, run create, deployment selection, input set selection, suite-only scopes, repetitions, events, rankings, failures, and scorecards.
- [agentclash-regression-flywheel](https://www.agentclash.dev/docs-md/agent-skills/agentclash-regression-flywheel) - regression: Use when inspecting AgentClash run failure-review items, promoting useful failures into regression suites, editing regression suites or cases, and verifying suite-only reruns.
- [agentclash-scorecard-reader](https://www.agentclash.dev/docs-md/agent-skills/agentclash-scorecard-reader) - reviewing: Use when interpreting AgentClash rankings, scorecards, replay timelines, artifacts, LLM judge results, or failure-review evidence into source-backed findings and next actions.

## Challenge Pack Skills

Focused skills for planning, authoring, scoring, judging, tooling, artifacts, validation, and publishing challenge packs.

- [agentclash-challenge-pack-artifacts](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-artifacts) - challenge-pack-artifacts: Use when specifying AgentClash challenge pack assets, artifact references, produced file captures, evidence references, artifact upload/download expectations, and review-only evidence.
- [agentclash-challenge-pack-input-sets](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-input-sets) - challenge-pack-inputs: Use when designing AgentClash challenge pack cases and input sets for smoke, full benchmark, regression, edge-case, or CI suite-only coverage.
- [agentclash-challenge-pack-llm-judges](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-llm-judges) - challenge-pack-judging: Use when configuring AgentClash LLM-as-judge scoring, judge prompts, rubrics, assertion/reference/n-wise modes, evidence inputs, scorecard dimensions, abstention behavior, and judge result interpretation.
- [agentclash-challenge-pack-planner](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-planner) - challenge-pack-planning: Use when turning a vague AgentClash evaluation idea into a source-backed challenge pack plan with task boundaries, target agents, cases, input sets, scoring strategy, tools, artifacts, runtime policy, validation criteria, and handoff steps.
- [agentclash-challenge-pack-scoring-validators](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-scoring-validators) - challenge-pack-scoring: Use when defining deterministic AgentClash scoring validators, scorecard dimensions, evidence sources, pass/fail rules, numeric metrics, file checks, and validator result interpretation.
- [agentclash-challenge-pack-tools-sandbox](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-tools-sandbox) - challenge-pack-tools: Use when defining AgentClash challenge pack tool access, sandbox runtime needs, filesystem expectations, network policy, command execution, and secret references.
- [agentclash-challenge-pack-validation-publish](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-validation-publish) - challenge-pack-publication: Use when validating AgentClash challenge pack YAML, fixing schema/scoring/tool/asset errors, publishing runnable pack versions, recording returned IDs, and preparing next eval commands.
- [agentclash-challenge-pack-yaml-author](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-yaml-author) - challenge-pack-authoring: Use when writing or editing AgentClash challenge pack YAML, including pack/version metadata, execution mode, challenges, cases, input sets, scoring blocks, tools, sandbox settings, assets, and validation handoff.

## Agent Build Skills

Focused skills for agent build specs, deployments, runtime resources, provider accounts, model aliases, and secrets.

- [agentclash-agent-build-author](https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills/agentclash-agent-build-author) - agent-builds: Use when creating, editing, validating, or readying AgentClash agent builds and build versions, including agent identity, spec JSON, prompts, model/runtime expectations, tool bindings, and version readiness.
- [agentclash-agent-deployment-setup](https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills/agentclash-agent-deployment-setup) - agent-deployments: Use when creating, selecting, or diagnosing AgentClash agent deployments for runs, including ready build versions, runtime profiles, provider/model wiring, deployment IDs, workspace context, and run compatibility.
- [agentclash-runtime-resources-setup](https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills/agentclash-runtime-resources-setup) - runtime-resources: Use when configuring AgentClash workspace secrets, provider accounts, model catalog entries, model aliases, runtime profiles, workspace tools, and readiness checks required before agent builds, deployments, evals, or runs.

## Canonical Layout

```text
web/content/agent-skills/<category-or-skill>/.../SKILL.md
```

Each skill keeps the main instructions focused and uses trigger-oriented frontmatter so agents can discover the right workflow before loading the full body.

## Catalog Contract

The root catalog skill is the authoring and review contract for all AgentClash skills.

````markdown
---
name: agentclash-skill-catalog
description: Use when creating, reviewing, or updating AgentClash agent-skill folders so the catalog taxonomy, frontmatter, generated docs, markdown exports, and llms.txt surfaces stay consistent.
metadata:
  agentclash.role: catalog
  agentclash.version: "1"
  agentclash.requires_cli: "false"
---

# AgentClash Skill Catalog

## Purpose
Define the folder taxonomy and publishing contract for AgentClash Agent Skills. Use this skill before adding or changing any `web/content/agent-skills/**/SKILL.md` file.

## Use When
- A user asks to add a new AgentClash skill.
- A user asks to update the skill catalog, taxonomy, metadata, generated docs, markdown exports, or `llms.txt` coverage.
- A reviewer needs to check whether a skill can be copied into Codex, Claude Code, Cursor, or another coding-agent workflow.

## Do Not Use When
- The task is only to run an eval, read a scorecard, configure the CLI, or author a challenge pack.
- The task changes product docs outside the Agent Skills catalog and does not affect skill discovery.

## Inputs Needed
- Feature area and intended user workflow.
- The upstream skill dependencies that must be read first.
- Source-backed command names, field names, examples, and failure modes.
- Whether the workflow targets hosted production, local development, or a self-hosted backend.

## Canonical Folder Taxonomy
The canonical source is always a `SKILL.md` file under `web/content/agent-skills`.

```text
web/content/agent-skills/SKILL.md
web/content/agent-skills/<top-level-skill>/SKILL.md
web/content/agent-skills/agent-build-skills/<skill>/SKILL.md
web/content/agent-skills/challenge-pack-skills/<skill>/SKILL.md
```

Use top-level folders for cross-cutting workflows such as CLI setup, eval running, scorecard reading, regression, and CI gates. Use `agent-build-skills/` for agent build specs, runtime resources, deployments, providers, secrets, and model aliases. Use `challenge-pack-skills/` for challenge pack planning, YAML authoring, inputs, tools, artifacts, scoring, judges, validation, and publish workflows.

## Required Frontmatter
Every skill must start with YAML frontmatter that the docs generator can parse with `gray-matter`.

```yaml
---
name: agentclash-example-skill
description: Use when the trigger is specific enough that an agent can choose this skill before reading the body.
metadata:
  agentclash.role: example
  agentclash.version: "1"
  agentclash.requires_cli: "true"
---
```

Field rules:
- `name`: stable kebab-case skill identifier, usually matching the folder name.
- `description`: trigger-oriented sentence that starts with "Use when" and names the workflow, not a generic summary.
- `metadata.agentclash.role`: short role label shown in the catalog list.
- `metadata.agentclash.version`: string version for the skill contract.
- `metadata.agentclash.requires_cli`: string `"true"` or `"false"` so installers and reviewers can spot CLI-dependent skills.

## Required Body Sections
Each skill should be useful without reading the AgentClash source code. Include these sections unless the section is not applicable and explicitly say why.

- `Purpose`: one paragraph describing the workflow outcome.
- `Use When`: concrete triggers that should load the skill.
- `Do Not Use When`: nearby workflows that should choose a different skill.
- `Inputs Needed`: information the agent should collect before acting.
- `Environment`: backend defaults, credentials, workspace, and local/self-hosted differences.
- `Procedure`: ordered operating steps.
- `Commands`: copyable commands with placeholders.
- `Expected Output`: what success looks like.
- `Failure Modes`: common errors and recovery steps.
- `Safety Notes`: secrets, destructive actions, cost, publish, or production cautions.
- `Report Back Format`: concise format the agent should use when done.
- `Related Docs`: `/docs-md/...` links that support the workflow.

## Generated Docs Contract
The web docs generator discovers skill files from `web/content/agent-skills/**/SKILL.md`.

- `/docs/agent-skills` and `/docs-md/agent-skills` render the catalog index and this catalog contract.
- `/docs/agent-skills/<skill>` and `/docs-md/agent-skills/<skill>` render individual top-level skills.
- `/docs/agent-skills/<category>/<skill>` and `/docs-md/agent-skills/<category>/<skill>` render nested category skills.
- `/llms.txt` includes the Agent Skills entry and every discovered skill page.
- `/llms-full.txt` includes the Agent Skills catalog, category pages, and full skill bodies.

When adding a new category, update the docs navigation and category map in `web/src/lib/docs.ts` so the category page, markdown path, and bundle order are explicit.

## Hosted Backend Examples
Use hosted production by default:

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
agentclash auth login --device
agentclash workspace list
agentclash workspace use <workspace-id>
```

Only use local or self-hosted URLs when the skill is explicitly about local development or deployment:

```bash
agentclash --api-url http://localhost:8080 doctor
```

## Authoring Procedure
1. Pick the folder from the taxonomy.
2. Read the related upstream skills in dependency order.
3. Verify command names, config names, YAML fields, and API behavior from source-backed docs or code.
4. Write trigger-oriented frontmatter and the required body sections.
5. Prefer production hosted examples unless the workflow is local or self-hosted.
6. Add or update docs-generation tests when a new path, category, or export behavior is introduced.
7. Run the docs tests and lint before opening a PR.

## Validation Commands
```bash
cd web
npm test -- docs.test.ts
npm run lint
```

For CLI packaging-related skill changes, also validate from `cli/`:

```bash
go build ./...
go test -short -race -count=1 ./...
```

## Failure Modes
- Missing frontmatter: the generated page may have an empty name or description.
- Vague description: agents may not discover the skill at the right time.
- New category without generator updates: `/docs-md/agent-skills/<category>` may not exist.
- Examples that default to localhost: users may accidentally run against the wrong backend.
- Claims not tied to current docs or code: downstream skills will repeat incorrect fields or commands.

## Safety Notes
- Do not include tokens, workspace secrets, or customer data in examples.
- Do not tell agents to publish, delete, or mutate production resources without an explicit confirmation step.
- Prefer read-only discovery commands before write commands.
- Keep skill instructions portable; avoid relying on a single agent product unless the section is explicitly install-target guidance.

## Dependency Order
Read related skills in this order so downstream workflows do not redefine upstream concepts:

1. `agentclash-skill-catalog`
2. `agentclash-cli-setup`
3. `agentclash-runtime-resources-setup`
4. `agentclash-agent-build-author`
5. `agentclash-agent-deployment-setup`
6. `agentclash-challenge-pack-planner`
7. `agentclash-challenge-pack-yaml-author`
8. `agentclash-challenge-pack-input-sets`
9. `agentclash-challenge-pack-tools-sandbox`
10. `agentclash-challenge-pack-artifacts`
11. `agentclash-challenge-pack-scoring-validators`
12. `agentclash-challenge-pack-llm-judges`
13. `agentclash-challenge-pack-validation-publish`
14. `agentclash-eval-runner`
15. `agentclash-scorecard-reader`
16. `agentclash-regression-flywheel`
17. `agentclash-ci-release-gate`

## Review Checklist
- The folder path matches the taxonomy.
- Frontmatter includes `name`, trigger-oriented `description`, and the three `metadata.agentclash.*` fields.
- Commands and examples use `https://api.agentclash.dev` unless local or self-hosted behavior is explicit.
- The body includes inputs, exact fields or commands, expected outputs, failure modes, safety notes, and report-back format.
- Related docs use `/docs-md/...` links.
- Generated docs, markdown exports, `llms.txt`, and `llms-full.txt` include the updated content.

## Report Back Format
```text
Skill: <name>
Path: web/content/agent-skills/<path>/SKILL.md
Category: <core | agent-build-skills | challenge-pack-skills>
Docs: /docs-md/agent-skills/<path>
Validation: <commands run and result>
Notes: <source-backed caveats or follow-ups>
```

## Related Docs
- `/docs-md/agent-skills`
- `/docs-md/guides/use-with-ai-tools`
- `/docs-md/reference/cli`
- `/docs-md/reference/config`
````

---

# Challenge Pack Skills

Focused skills for planning, authoring, scoring, judging, tooling, artifacts, validation, and publishing challenge packs.

Source: https://www.agentclash.dev/docs/agent-skills/challenge-pack-skills
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills

Focused skills for planning, authoring, scoring, judging, tooling, artifacts, validation, and publishing challenge packs.

## Skills

- [agentclash-challenge-pack-artifacts](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-artifacts) - challenge-pack-artifacts: Use when specifying AgentClash challenge pack assets, artifact references, produced file captures, evidence references, artifact upload/download expectations, and review-only evidence.
- [agentclash-challenge-pack-input-sets](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-input-sets) - challenge-pack-inputs: Use when designing AgentClash challenge pack cases and input sets for smoke, full benchmark, regression, edge-case, or CI suite-only coverage.
- [agentclash-challenge-pack-llm-judges](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-llm-judges) - challenge-pack-judging: Use when configuring AgentClash LLM-as-judge scoring, judge prompts, rubrics, assertion/reference/n-wise modes, evidence inputs, scorecard dimensions, abstention behavior, and judge result interpretation.
- [agentclash-challenge-pack-planner](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-planner) - challenge-pack-planning: Use when turning a vague AgentClash evaluation idea into a source-backed challenge pack plan with task boundaries, target agents, cases, input sets, scoring strategy, tools, artifacts, runtime policy, validation criteria, and handoff steps.
- [agentclash-challenge-pack-scoring-validators](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-scoring-validators) - challenge-pack-scoring: Use when defining deterministic AgentClash scoring validators, scorecard dimensions, evidence sources, pass/fail rules, numeric metrics, file checks, and validator result interpretation.
- [agentclash-challenge-pack-tools-sandbox](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-tools-sandbox) - challenge-pack-tools: Use when defining AgentClash challenge pack tool access, sandbox runtime needs, filesystem expectations, network policy, command execution, and secret references.
- [agentclash-challenge-pack-validation-publish](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-validation-publish) - challenge-pack-publication: Use when validating AgentClash challenge pack YAML, fixing schema/scoring/tool/asset errors, publishing runnable pack versions, recording returned IDs, and preparing next eval commands.
- [agentclash-challenge-pack-yaml-author](https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-yaml-author) - challenge-pack-authoring: Use when writing or editing AgentClash challenge pack YAML, including pack/version metadata, execution mode, challenges, cases, input sets, scoring blocks, tools, sandbox settings, assets, and validation handoff.

## Canonical Layout

```text
web/content/agent-skills/challenge-pack-skills/<skill>/SKILL.md
```

---

# Agent Build Skills

Focused skills for agent build specs, deployments, runtime resources, provider accounts, model aliases, and secrets.

Source: https://www.agentclash.dev/docs/agent-skills/agent-build-skills
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills

Focused skills for agent build specs, deployments, runtime resources, provider accounts, model aliases, and secrets.

## Skills

- [agentclash-agent-build-author](https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills/agentclash-agent-build-author) - agent-builds: Use when creating, editing, validating, or readying AgentClash agent builds and build versions, including agent identity, spec JSON, prompts, model/runtime expectations, tool bindings, and version readiness.
- [agentclash-agent-deployment-setup](https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills/agentclash-agent-deployment-setup) - agent-deployments: Use when creating, selecting, or diagnosing AgentClash agent deployments for runs, including ready build versions, runtime profiles, provider/model wiring, deployment IDs, workspace context, and run compatibility.
- [agentclash-runtime-resources-setup](https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills/agentclash-runtime-resources-setup) - runtime-resources: Use when configuring AgentClash workspace secrets, provider accounts, model catalog entries, model aliases, runtime profiles, workspace tools, and readiness checks required before agent builds, deployments, evals, or runs.

## Canonical Layout

```text
web/content/agent-skills/agent-build-skills/<skill>/SKILL.md
```

---

# CLI Setup Skill

Use when configuring the AgentClash CLI, authenticating with device login or tokens, selecting a workspace, saving default config with link, creating project config with init, resolving API URL precedence, or diagnosing CLI access against production, local, or self-hosted backends.

Source: https://www.agentclash.dev/docs/agent-skills/agentclash-cli-setup
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/agentclash-cli-setup

Canonical source: `web/content/agent-skills/agentclash-cli-setup/SKILL.md`

Markdown export: `/docs-md/agent-skills/agentclash-cli-setup`

## Use This Skill When

Use when configuring the AgentClash CLI, authenticating with device login or tokens, selecting a workspace, saving default config with link, creating project config with init, resolving API URL precedence, or diagnosing CLI access against production, local, or self-hosted backends.

## Full SKILL.md

````markdown
---
name: agentclash-cli-setup
description: Use when configuring the AgentClash CLI, authenticating with device login or tokens, selecting a workspace, saving default config with link, creating project config with init, resolving API URL precedence, or diagnosing CLI access against production, local, or self-hosted backends.
metadata:
  agentclash.role: setup
  agentclash.version: "1"
  agentclash.requires_cli: "true"
---

# AgentClash CLI Setup

## Purpose
Configure an AgentClash CLI session that can reach the intended backend, authenticate safely, resolve the right workspace, and pass `agentclash doctor` before a user starts authoring packs or running evals.

## Use When
- A user asks to install, authenticate, verify, or repair AgentClash CLI access.
- A coding agent needs AgentClash access but does not have the AgentClash source repo.
- A user needs to switch workspaces, save a default workspace, or create project-local `.agentclash.yaml` config.
- A CLI command fails because API URL, token, organization, workspace, or saved config precedence is unclear.
- CI needs a non-interactive setup path with `AGENTCLASH_TOKEN`.

## Do Not Use When
- The CLI is already configured and the task is to run an eval or inspect a scorecard.
- The user needs to author challenge pack YAML; use `agentclash-challenge-pack-yaml-author` after setup passes.
- The user is making a release decision or CI gate from completed runs; use `agentclash-ci-release-gate`.
- The task requires changing AgentClash CLI source code rather than using the CLI.

## Inputs Needed
- Backend target: hosted production, local development, or self-hosted.
- Whether the CLI is installed as `agentclash` or run from source with `go run .` inside `cli/`.
- Whether browser/device login is acceptable, or whether `AGENTCLASH_TOKEN` must be used.
- Organization ID if `workspace list` cannot infer one from saved config.
- Workspace ID, workspace slug/name, or permission to choose one interactively with `agentclash link`.
- Whether the command should mutate saved user config under `~/.config/agentclash/config.yaml`.
- Whether the current repository should get project-local `.agentclash.yaml` config with `agentclash init`.

## Environment
Use hosted production by default:

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
```

Use a local backend only when the user is explicitly running the API server locally:

```bash
export AGENTCLASH_API_URL="http://localhost:8080"
```

For CI or other non-interactive shells, provide a token through the environment instead of running browser login:

```bash
export AGENTCLASH_TOKEN="<token>"
export AGENTCLASH_WORKSPACE="<workspace-id>"
```

Do not print token values in chat, logs, docs, or committed files.

## Config And Precedence
API URL resolution:

```text
--api-url > AGENTCLASH_API_URL > saved user config > default
```

The source-build default is `http://localhost:8080`. Released CLI builds stamp the production default at release time, but skills should still set `AGENTCLASH_API_URL="https://api.agentclash.dev"` explicitly for hosted workflows so copied commands behave the same in source and released builds.

Workspace resolution:

```text
--workspace / -w > AGENTCLASH_WORKSPACE > project .agentclash.yaml > saved user config
```

Organization resolution for commands that need an org:

```text
AGENTCLASH_ORG > project .agentclash.yaml > saved user config
```

Auth token resolution:

```text
AGENTCLASH_TOKEN > stored CLI credentials
```

Saved user config lives at:

```text
~/.config/agentclash/config.yaml
```

Project-local config lives in `.agentclash.yaml` and is discovered by walking up from the current directory. It can store `workspace_id` and `org_id`.

## Procedure
1. Choose the backend. For normal hosted work, export `AGENTCLASH_API_URL="https://api.agentclash.dev"` first.
2. Verify the CLI is callable with `agentclash version` or, from the repo `cli/` directory, `go run . version`.
3. Authenticate. Prefer `agentclash auth login --device` for remote shells; use `AGENTCLASH_TOKEN` for CI.
4. Check the authenticated identity with `agentclash auth status`.
5. Select a workspace with `agentclash link` for the guided flow, or `agentclash workspace use <workspace-id>` when the ID is already known. Both commands write saved user config.
6. If this repository should carry its own workspace binding, run `agentclash init --workspace-id <workspace-id> --org-id <organization-id>` from the project root.
7. Run `agentclash doctor` to verify install, auth, workspace, challenge-pack visibility, deployment visibility, and baseline readiness.
8. Report the effective backend, auth state, workspace, doctor result, and the next command the user should run.

## Commands
Hosted production, interactive setup:

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
agentclash version
agentclash auth login --device
agentclash auth status
agentclash link
agentclash doctor
```

Hosted production when the workspace ID is already known:

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
agentclash auth login --device
agentclash workspace use <workspace-id>
agentclash doctor
```

Workspace discovery when an organization must be explicit:

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
agentclash workspace list --org <organization-id>
agentclash workspace get <workspace-id>
agentclash workspace use <workspace-id>
```

Project-local config for a repository:

```bash
agentclash init --workspace-id <workspace-id> --org-id <organization-id>
```

CI or non-interactive setup:

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
export AGENTCLASH_TOKEN="<token>"
export AGENTCLASH_WORKSPACE="<workspace-id>"
agentclash doctor --json
```

Local development from the CLI module:

```bash
cd cli
export AGENTCLASH_API_URL="http://localhost:8080"
go run . auth login --device
go run . link
go run . doctor
```

One-off backend override without changing saved config:

```bash
agentclash --api-url http://localhost:8080 doctor
```

## Expected Output
- `auth login --device` prints a verification URL or confirms an existing valid login.
- `auth status` shows the authenticated user and accessible organization/workspace counts.
- `link` saves the selected workspace and organization in user config and prints the next suggested command. It does not write `.agentclash.yaml`.
- `workspace use <workspace-id>` validates access and saves `default_workspace`; it also saves `default_org` when the workspace details include an organization ID.
- `init --workspace-id <workspace-id> --org-id <organization-id>` writes project-local `.agentclash.yaml` in the current directory.
- `doctor` prints the effective API URL, workspace, setup checks, and suggested next steps. In JSON mode, `ready: true` means no `warn` or `fail` checks remain.

## Failure Modes
- `AGENTCLASH_TOKEN is set but could not be validated`: the environment token is present and takes precedence, but it is wrong for this backend. Unset or replace `AGENTCLASH_TOKEN`, then rerun `agentclash auth login --device` or `agentclash auth status`.
- `Not logged in. No API token is configured.`: run `agentclash auth login --device` for interactive work, or set `AGENTCLASH_TOKEN` for CI.
- Commands unexpectedly hit `http://localhost:8080`: set `AGENTCLASH_API_URL="https://api.agentclash.dev"` or pass `--api-url`; source builds default to localhost.
- `organization ID required`: pass `--org <organization-id>` or set a default organization through `agentclash link`, `agentclash workspace use`, or config.
- `no workspace specified`: run `agentclash link`, pass `--workspace <workspace-id>`, set `AGENTCLASH_WORKSPACE`, or create project config with `agentclash init`.
- `Workspace <id> is not accessible`: the saved or env workspace does not belong to the current token/backend. Run `agentclash link` or update `AGENTCLASH_WORKSPACE`.
- `doctor` reports no challenge packs: setup is valid, but the workspace needs `agentclash challenge-pack init`, `validate`, and `publish` before evals are useful.
- `doctor` reports no deployments: setup is valid, but an agent deployment must be created before starting evals.
- `doctor` reports no baseline: this is advisory on a fresh workspace; set one after the first completed eval with `agentclash baseline set`.

## Safety Notes
- Never paste, print, or commit `AGENTCLASH_TOKEN` or provider secrets.
- Ask before changing saved user config when the user is sharing a machine or switching production workspaces.
- Prefer `--api-url` for one-off local/self-hosted checks so saved production config is not accidentally changed.
- Prefer `doctor --json` in automation because it returns a machine-readable `ready` field and exits non-zero when setup warnings remain.
- Run read-only commands (`auth status`, `workspace list`, `doctor`) before write or publish commands.

## Report Back Format
```text
Backend: <effective API URL>
Auth: <ok | action needed> (<source: AGENTCLASH_TOKEN | stored credentials | none>)
Workspace: <workspace-id or none>
Doctor: <ready | warnings> - <short check summary>
Next command: <single recommended command>
Notes: <config precedence, local override, or token caveat if relevant>
```

## Related Docs
- `/docs-md/getting-started/quickstart`
- `/docs-md/guides/use-with-ai-tools`
- `/docs-md/reference/cli`
- `/docs-md/reference/config`
- `/docs-md/agent-skills`
````

---

# Eval Runner Skill

Use when starting, following, inspecting, or reporting AgentClash eval runs with the CLI, especially eval start, run create, deployment selection, input set selection, suite-only scopes, repetitions, events, rankings, failures, and scorecards.

Source: https://www.agentclash.dev/docs/agent-skills/agentclash-eval-runner
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/agentclash-eval-runner

Canonical source: `web/content/agent-skills/agentclash-eval-runner/SKILL.md`

Markdown export: `/docs-md/agent-skills/agentclash-eval-runner`

## Use This Skill When

Use when starting, following, inspecting, or reporting AgentClash eval runs with the CLI, especially eval start, run create, deployment selection, input set selection, suite-only scopes, repetitions, events, rankings, failures, and scorecards.

## Full SKILL.md

````markdown
---
name: agentclash-eval-runner
description: Use when starting, following, inspecting, or reporting AgentClash eval runs with the CLI, especially eval start, run create, deployment selection, input set selection, suite-only scopes, repetitions, events, rankings, failures, and scorecards.
metadata:
  agentclash.role: running
  agentclash.version: "1"
  agentclash.requires_cli: "true"
---

# AgentClash Eval Runner

## Purpose
Create an AgentClash eval run or eval session against a published challenge pack, follow it when useful, inspect evidence after it runs, and report stable commands a reviewer can repeat.

## Use When
- A user asks to run one or more agent deployments against a published challenge pack.
- A user wants to choose a challenge pack, version, input set, deployment, regression suite, or run scope from the CLI.
- A user wants live run events, rankings, failures, agents, or scorecards after a run starts.
- A CI or local workflow needs exact non-interactive commands.

## Do Not Use When
- The challenge pack is not authored or published yet; use the challenge-pack skills first.
- The user needs to create deployments or runtime resources; use `agentclash-agent-deployment-setup` or `agentclash-runtime-resources-setup`.
- The task is only to interpret an already generated scorecard in depth; use `agentclash-scorecard-reader`.
- The task is a release gate or CI manifest workflow; use `agentclash-ci-release-gate`.

## Inputs Needed
- Workspace ID or configured default workspace.
- Challenge pack selector: pack ID, slug, exact name, or challenge pack version ID.
- Challenge pack version selector: version ID or version number.
- Input set selector: input set ID, key, or exact name.
- Agent deployment selectors: deployment IDs or exact names.
- Scope: `full` or `suite_only`.
- Optional regression suite IDs/names or regression case IDs.
- Whether to stream events with `--follow`.
- Whether this is a repeated eval session with `--repetitions`.

## Environment
Use hosted production by default unless the user intentionally targets local or self-hosted infrastructure:

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
```

Before creating a run, verify auth and workspace context:

```bash
agentclash auth status
agentclash workspace use <WORKSPACE_ID>
agentclash challenge-pack list --json
agentclash deployment list --json
```

Workspace resolution follows the CLI setup rules: `--workspace`, `AGENTCLASH_WORKSPACE`, saved config, or `.agentclash.yaml`. `eval start`, `run create`, `run list`, `run failures`, and `eval scorecard` require a workspace.

## Prefer `eval start` for Humans and Agents
`agentclash eval start` wraps `agentclash run create` but resolves selectors through workspace reads. Use it when names, slugs, input-set keys, or guided selection are useful.

```bash
agentclash eval start \
  --pack <PACK_ID_OR_SLUG_OR_EXACT_NAME> \
  --pack-version <VERSION_ID_OR_VERSION_NUMBER> \
  --input-set <INPUT_SET_ID_OR_KEY_OR_EXACT_NAME> \
  --deployment <DEPLOYMENT_ID_OR_EXACT_NAME> \
  --name "Smoke eval" \
  --follow
```

Exact `eval start` flags:

- `--pack`: challenge pack ID, slug, or exact name.
- `--pack-version`: challenge pack version ID or version number. Use `--pack` when selecting by version number; a version ID can identify the pack by itself.
- `--input-set`: challenge input set ID, key, or exact name.
- `--deployment`: deployment ID or exact name. Repeat this flag for multiple deployments.
- `--name`: optional run name.
- `--follow`: stream run events after creation.
- `--scope`: `full` or `suite_only`; default is `full`.
- `--suite`: regression suite ID or exact name. Repeatable.
- `--case`: regression case IDs. Repeatable.
- `--race-context`: enable live peer-standings injection during the run.
- `--race-context-cadence`: 0 for backend default, otherwise 1 through 10.
- `--repetitions`: repeat the eval 1 through 100 times; values 2 or greater use `/v1/eval-sessions`.

Selector behavior:

- Pack selectors match ID, slug, or exact name.
- Deployment selectors match ID or exact name.
- Suite selectors match ID or exact name, are filtered to the selected pack when possible, and must resolve to active suites.
- Input set selectors match ID, input key, or exact name.
- Selectors are exact or case-insensitive exact matches, not substring search.
- If no pack is specified and there is one pack, the CLI uses it; with multiple packs in non-interactive mode, pass `--pack` or `--pack-version`.
- If no version is specified, the CLI uses the highest `version_number` for the selected pack.
- If a version has no input sets, the CLI submits without `challenge_input_set_id`.
- If a version has one input set, the CLI uses it.
- If a version has multiple input sets in non-interactive mode, pass `--input-set`.
- If multiple deployments exist in non-interactive mode, pass at least one `--deployment`.

## Use `run create` for ID-First Automation
`agentclash run create` posts directly to `/v1/runs`. Use it when a script already has IDs.

```bash
agentclash run create \
  --challenge-pack-version <CHALLENGE_PACK_VERSION_ID> \
  --input-set <CHALLENGE_INPUT_SET_ID> \
  --deployments <AGENT_DEPLOYMENT_ID> \
  --name "Smoke eval" \
  --follow
```

Exact `run create` notes:

- The lower-level flag is plural `--deployments`; it expects deployment IDs.
- `--challenge-pack-version` expects a challenge pack version ID.
- `--input-set` expects a challenge input set ID.
- In non-interactive mode, `--challenge-pack-version` and `--deployments` are required.
- In a TTY, missing challenge pack version, input set, or deployments can open pickers.
- `run create` does not resolve pack slugs, input set keys, or deployment names. Use `eval start` for that.
- `--scope`, `--suite`, `--case`, `--race-context`, and `--race-context-cadence` behave like `eval start`, but suite and case flags are ID-first.

The run create request body sent by the CLI contains:

```json
{
  "workspace_id": "<WORKSPACE_ID>",
  "challenge_pack_version_id": "<CHALLENGE_PACK_VERSION_ID>",
  "challenge_input_set_id": "<CHALLENGE_INPUT_SET_ID>",
  "agent_deployment_ids": ["<AGENT_DEPLOYMENT_ID>"],
  "official_pack_mode": "full",
  "name": "Smoke eval",
  "regression_suite_ids": ["<REGRESSION_SUITE_ID>"],
  "regression_case_ids": ["<REGRESSION_CASE_ID>"],
  "race_context": true,
  "race_context_min_step_gap": 3
}
```

Optional fields are omitted when not set. The create-run API requires JSON, caps the body at 1 MiB, rejects unknown JSON fields, and returns:

```json
{
  "id": "<RUN_ID>",
  "workspace_id": "<WORKSPACE_ID>",
  "challenge_pack_version_id": "<CHALLENGE_PACK_VERSION_ID>",
  "challenge_input_set_id": "<CHALLENGE_INPUT_SET_ID>",
  "official_pack_mode": "full",
  "status": "queued",
  "execution_mode": "single_agent",
  "created_at": "<timestamp>",
  "queued_at": "<timestamp>",
  "race_context": false,
  "links": {
    "self": "/v1/runs/<RUN_ID>",
    "agents": "/v1/runs/<RUN_ID>/agents"
  }
}
```

## Suite-Only Runs
Use suite-only scope when you want to run only selected regression suites or cases.

With `eval start`, suites can be IDs or exact names:

```bash
agentclash eval start \
  --pack <PACK_ID_OR_SLUG> \
  --pack-version <VERSION_ID_OR_NUMBER> \
  --deployment <DEPLOYMENT_ID_OR_NAME> \
  --scope suite_only \
  --suite <REGRESSION_SUITE_ID_OR_EXACT_NAME> \
  --follow
```

With `run create`, use IDs:

```bash
agentclash run create \
  --challenge-pack-version <CHALLENGE_PACK_VERSION_ID> \
  --deployments <AGENT_DEPLOYMENT_ID> \
  --scope suite_only \
  --suite <REGRESSION_SUITE_ID> \
  --follow
```

`--scope suite_only` requires at least one `--suite` or `--case`.

## Repeated Eval Sessions
Use `--repetitions` on `eval start` for repeated runs of the same eval.

```bash
agentclash eval start \
  --pack <PACK_ID_OR_SLUG> \
  --pack-version <VERSION_ID_OR_NUMBER> \
  --input-set <INPUT_SET_ID_OR_KEY> \
  --deployment <DEPLOYMENT_ID_OR_NAME> \
  --repetitions 3 \
  --json
```

Exact repetition behavior:

- `--repetitions` must be between 1 and 100.
- `--repetitions 1` creates a normal run through `/v1/runs`.
- `--repetitions >= 2` posts to `/v1/eval-sessions`.
- `--follow` is not supported with `--repetitions >= 2`; tail individual child runs with `agentclash run events <RUN_ID>`.
- `--scope suite_only`, `--suite`, `--case`, and race-context flags are not supported with `--repetitions >= 2`.
- The eval-session response is `{ "eval_session": {...}, "run_ids": [...] }`.

In human output, the CLI prints eval session ID, status, repetitions, and child run IDs. In structured output, it prints the raw response envelope.

## Follow and Events
Use `--follow` for interactive runs when you want immediate event visibility.

```bash
agentclash eval start ... --follow
agentclash run create ... --follow
agentclash run events <RUN_ID>
```

`run events` streams `/v1/runs/<runID>/events/stream` via SSE.

- In structured output mode (`--json` or `--output yaml`), `eval start --follow` and `run create --follow` print the created run and do not stream events. Use `agentclash run events <RUN_ID> --json` or `--output yaml` for structured event streams.
- Human output prints timestamped event summaries.
- `--json` prints one NDJSON event payload per line.
- `--output yaml` prints a YAML multi-document stream.
- Press Ctrl+C to stop an event stream.

## Inspect After Creation
Use these read commands after a run is created:

```bash
agentclash run list --json
agentclash run get <RUN_ID> --json
agentclash run agents <RUN_ID> --json
agentclash run ranking <RUN_ID> --json
agentclash run ranking <RUN_ID> --sort-by composite
agentclash run failures <RUN_ID> --json
agentclash eval scorecard <RUN_ID> --agent <RUN_AGENT_ID_OR_LABEL> --json
agentclash run scorecard <RUN_AGENT_ID> --json
```

Read command notes:

- `run list` lists runs in the workspace.
- `run get` reads `/v1/runs/<id>`.
- `run agents` lists run agents and labels.
- `run ranking --sort-by` accepts `composite`, `correctness`, `reliability`, `latency`, or `cost`.
- `run failures` accepts `--agent`, `--severity`, `--class`, `--evidence-tier`, `--cluster`, `--cursor`, and `--limit`.
- `run scorecard` expects a run agent ID.
- `eval scorecard [run]` is run-first. If run is omitted, it selects the latest workspace run; with multiple run agents, pass `--agent` in non-interactive mode.
- `eval scorecard --json` returns an envelope with `candidate`, `baseline`, `scorecard`, `comparison`, and `release_gate`.
- If scorecard generation is pending, stateful scorecard reads can return a pending payload instead of a final scorecard.

## Common Failure Modes
- No workspace: run `agentclash link`, `agentclash workspace use <id>`, pass `--workspace`, or set `AGENTCLASH_WORKSPACE`.
- No challenge packs: publish a pack first with `agentclash-challenge-pack-validation-publish`.
- Multiple packs in non-interactive `eval start`: pass `--pack` or a version ID through `--pack-version`.
- Version number without pack: pass `--pack` as well, because a bare version number cannot identify a pack.
- Multiple input sets: pass `--input-set`; `eval start` can use ID/key/exact name, while `run create` expects ID.
- Multiple deployments: pass one or more `--deployment` flags for `eval start`, or `--deployments` IDs for `run create`.
- `missing_challenge_input_set_id`: the selected pack version has multiple input sets and no input set ID was submitted.
- `invalid_agent_deployment_ids`: deployment IDs must be active deployments with snapshots in the selected workspace, with no duplicates.
- `invalid_challenge_pack_version_id`: the version must be runnable and visible to the selected workspace.
- `invalid_challenge_input_set_id`: the input set must belong to the selected challenge pack version.
- `invalid_race_context`: race context requires at least two agents.
- `--race-context-cadence must be 0 (backend default) or between 1 and 10`: fix the cadence value.
- `--follow is not supported with --repetitions >= 2`: create the eval session, then stream individual child runs with `run events`.
- `--scope suite_only requires at least one --suite or --case`: add a suite or case selection.
- Scorecard pending or errored: report the state, then collect `run events`, `run agents`, and `run failures`.

## Safety Notes
- Creating runs can spend provider budget and may execute tools or network access allowed by the deployment/runtime profile.
- Confirm before running production-scale, multi-deployment, high-repetition, or network-enabled evals.
- Prefer small input sets and `--scope suite_only` for smoke checks.
- Do not paste secrets from run events, scorecards, failures, artifacts, or logs into chat.
- Use `--json` for automation and save run IDs before starting follow streams.

## Report Back Format
```text
Workspace: <workspace-id>
Command used:
Run ID: <id or none>
Eval session ID: <id or none>
Child run IDs: <ids if repetitions >= 2>
Challenge pack version: <id>
Input set: <id/key/name or none>
Deployments:
- <id/name>
Scope: <full|suite_only>
Followed: <yes/no>
Status: <queued/running/completed/failed/etc>
Agents: <count and labels>
Ranking: <summary or unavailable>
Failures: <count/filter summary or unavailable>
Scorecard: <state/link/summary or unavailable>
Evidence commands:
- agentclash run get <RUN_ID> --json
- agentclash run agents <RUN_ID> --json
- agentclash run ranking <RUN_ID> --json
- agentclash run failures <RUN_ID> --json
Next action: <recommendation>
```

## Related Skills
- `agentclash-cli-setup`
- `agentclash-agent-deployment-setup`
- `agentclash-challenge-pack-validation-publish`
- `agentclash-scorecard-reader`
- `agentclash-regression-flywheel`
- `agentclash-ci-release-gate`

## Related Docs
- `/docs-md/getting-started/first-eval`
- `/docs-md/concepts/runs-and-evals`
- `/docs-md/concepts/replay-and-scorecards`
- `/docs-md/reference/cli`
````

---

# Scorecard Reader Skill

Use when interpreting AgentClash rankings, scorecards, replay timelines, artifacts, LLM judge results, or failure-review evidence into source-backed findings and next actions.

Source: https://www.agentclash.dev/docs/agent-skills/agentclash-scorecard-reader
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/agentclash-scorecard-reader

Canonical source: `web/content/agent-skills/agentclash-scorecard-reader/SKILL.md`

Markdown export: `/docs-md/agent-skills/agentclash-scorecard-reader`

## Use This Skill When

Use when interpreting AgentClash rankings, scorecards, replay timelines, artifacts, LLM judge results, or failure-review evidence into source-backed findings and next actions.

## Full SKILL.md

````markdown
---
name: agentclash-scorecard-reader
description: Use when interpreting AgentClash rankings, scorecards, replay timelines, artifacts, LLM judge results, or failure-review evidence into source-backed findings and next actions.
metadata:
  agentclash.role: reviewing
  agentclash.version: "1"
  agentclash.requires_cli: "true"
---

# AgentClash Scorecard Reader

## Purpose
Turn completed or inspectable AgentClash run evidence into an engineering readout: who won, why, which claims are backed by scorecard/replay/artifact evidence, and what the next command or fix should be.

## Use When
- A user asks why an AgentClash run passed, failed, regressed, drifted, or picked a winner.
- You have a run ID and need to inspect rankings, run agents, scorecards, replay steps, failure-review items, or artifacts.
- A reviewer needs evidence-first findings instead of raw JSON dumps.
- A follow-up skill needs a grounded summary before promoting regressions or changing a challenge pack.

## Do Not Use When
- The user needs to start a new eval run; use `agentclash-eval-runner`.
- The user needs to author, validate, or publish the challenge pack first; use the challenge-pack skills.
- The user is ready to promote failures into regression suites; use `agentclash-regression-flywheel` after this skill identifies the useful failures.

## Inputs Needed
- Workspace ID or configured workspace context.
- Run ID.
- Optional run agent ID or agent label for a specific scorecard or replay.
- Optional baseline expectation, expected winner, or release-gate decision to compare against.
- Optional artifact IDs from failure evidence, scorecards, or workspace artifact list.

## Environment
Use hosted production by default unless the user intentionally targets local or self-hosted infrastructure:

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
agentclash auth status
agentclash workspace use <WORKSPACE_ID>
```

Workspace resolution follows the CLI setup rules: `--workspace`, `AGENTCLASH_WORKSPACE`, saved config, or `.agentclash.yaml`. `run failures`, `artifact list`, and `eval scorecard` require a workspace. `artifact download` uses an artifact ID directly.

## Procedure
1. Confirm the run exists and get its agent IDs.
2. Read the ranking to identify the winner, sort mode, gaps, unavailable scores, and evidence warnings.
3. Read the relevant scorecard. Use `eval scorecard` for run-first analysis and baseline comparison; use `run scorecard` when you already have a run agent ID.
4. Read failure-review items for concrete failed checks, replay refs, artifact refs, judge refs, metric refs, severity, and promotability.
5. Pull replay steps around referenced sequences, or page through replay when no refs exist.
6. Inspect artifacts only when a scorecard, failure item, or user request points to them.
7. Report claims as evidence-first findings: claim, evidence pointer, impact, next action.

## Commands
Start with run-level shape:

```bash
agentclash run get <RUN_ID> --json
agentclash run agents <RUN_ID> --json
agentclash run ranking <RUN_ID> --json
agentclash run ranking <RUN_ID> --sort-by composite --json
agentclash run ranking <RUN_ID> --sort-by correctness --json
```

Scorecard commands:

```bash
agentclash eval scorecard <RUN_ID> --agent <RUN_AGENT_ID_OR_LABEL> --json
agentclash eval scorecard --agent <RUN_AGENT_ID_OR_LABEL> --json
agentclash run scorecard <RUN_AGENT_ID> --json
```

Replay and failure-review commands:

```bash
agentclash run failures <RUN_ID> --json
agentclash run failures <RUN_ID> --agent <RUN_AGENT_ID> --severity blocking --json
agentclash run failures <RUN_ID> --class policy_violation --evidence-tier hosted_structured --json
agentclash run failures <RUN_ID> --cluster <FAILURE_CLUSTER_KEY> --limit 50 --json
agentclash replay get <RUN_AGENT_ID> --limit 50 --json
agentclash replay get <RUN_AGENT_ID> --cursor 50 --limit 50 --json
```

Artifact commands:

```bash
agentclash artifact list --json
agentclash artifact download <ARTIFACT_ID> --output <PATH>
```

Important exact CLI shapes:

- `run scorecard` takes one argument: `<RUN_AGENT_ID>`. It does not accept `<RUN_ID> <RUN_AGENT_ID>`.
- `replay get` takes one argument: `<RUN_AGENT_ID>`. It does not accept `<RUN_ID> <RUN_AGENT_ID>`.
- `artifact list` is workspace-wide. It does not have a `--run` filter today; use `artifact_refs`, `run_id`, or `run_agent_id` fields from JSON output to choose what to download.
- `run ranking --sort-by` commonly uses `composite`, `correctness`, `reliability`, `latency`, or `cost`. The backend also accepts a custom dimension key when that key exists in the scorecard dimensions; unknown keys return `invalid_sort_by`.
- `run failures --limit` defaults to 50 when omitted and is capped at 200 by the API.

## Ranking JSON
`agentclash run ranking <RUN_ID> --json` returns a stateful response:

```json
{
  "state": "ready",
  "ranking": {
    "run_id": "<RUN_ID>",
    "evaluation_spec_id": "<EVALUATION_SPEC_ID>",
    "sort": {
      "field": "correctness_then_reliability",
      "direction": "desc",
      "default_order": true
    },
    "winner": {
      "run_agent_id": "<RUN_AGENT_ID>",
      "strategy": "<strategy>",
      "status": "<status>",
      "reason_code": "<reason_code>"
    },
    "evidence_quality": {
      "missing_fields": [],
      "warnings": []
    },
    "items": [
      {
        "rank": 1,
        "run_agent_id": "<RUN_AGENT_ID>",
        "lane_index": 0,
        "label": "<agent label>",
        "status": "completed",
        "has_scorecard": true,
        "evaluation_status": "complete",
        "sort_value": 0.92,
        "delta_from_top": 0,
        "sort_state": "available",
        "strategy": "<strategy>",
        "passed": true,
        "overall_reason": "<reason>",
        "composite_score": 0.91,
        "overall_score": 0.91,
        "correctness_score": 0.95,
        "reliability_score": 0.9,
        "latency_score": 0.8,
        "cost_score": 0.7,
        "dimensions": {
          "correctness": {
            "state": "available",
            "score": 0.95,
            "better_direction": "higher"
          }
        }
      }
    ]
  }
}
```

Read `evidence_quality.warnings` before declaring a winner as conclusive. A low `sort_value`, missing `rank`, `sort_state: "unavailable"`, `has_scorecard: false`, or missing score fields means the ranking may be partial even if the run itself completed.

## Scorecard JSON
Use `agentclash run scorecard <RUN_AGENT_ID> --json` when you already know the agent ID. The top level mirrors `/v1/scorecards/{runAgentID}`:

```json
{
  "state": "ready",
  "run_agent_status": "completed",
  "id": "<SCORECARD_ID>",
  "run_agent_id": "<RUN_AGENT_ID>",
  "run_id": "<RUN_ID>",
  "evaluation_spec_id": "<EVALUATION_SPEC_ID>",
  "overall_score": 0.91,
  "correctness_score": 0.95,
  "reliability_score": 0.9,
  "latency_score": 0.8,
  "cost_score": 0.7,
  "behavioral_score": 0.85,
  "llm_judge_results": [
    {
      "id": "<JUDGE_RESULT_ID>",
      "judge_key": "<judge_key>",
      "mode": "<mode>",
      "normalized_score": 0.8,
      "confidence": "medium",
      "variance": 0.02,
      "sample_count": 3,
      "model_count": 1,
      "payload": {},
      "created_at": "<timestamp>",
      "updated_at": "<timestamp>"
    }
  ],
  "scorecard": {
    "run_agent_id": "<RUN_AGENT_ID>",
    "evaluation_spec_id": "<EVALUATION_SPEC_ID>",
    "status": "complete",
    "strategy": "<strategy>",
    "overall_score": 0.91,
    "passed": true,
    "overall_reason": "<reason>",
    "warnings": [],
    "dimensions": {
      "correctness": {
        "state": "available",
        "score": 0.95,
        "reason": "<reason>",
        "weight": 1,
        "gate": true,
        "pass_threshold": 0.8,
        "contribution": 0.95,
        "gate_passed": true
      }
    },
    "validator_summary": {},
    "validator_details": [
      {
        "key": "<validator_key>",
        "type": "<validator_type>",
        "verdict": "pass",
        "state": "complete",
        "reason": "<reason>",
        "normalized_score": 1,
        "source": {
          "kind": "final_output",
          "sequence": 12,
          "event_type": "<event type>",
          "field_path": "<field path>"
        }
      }
    ],
    "metric_summary": {},
    "metric_details": [
      {
        "key": "<metric_key>",
        "collector": "<collector>",
        "state": "available",
        "numeric_value": 123
      }
    ]
  },
  "created_at": "<timestamp>",
  "updated_at": "<timestamp>"
}
```

Read the nested `scorecard.dimensions` first, then inspect `validator_details`, `metric_details`, and `llm_judge_results` for supporting evidence. Treat `llm_judge_results.payload` as judge-specific raw data; do not invent a stable schema inside it.

## Eval Scorecard Envelope
`agentclash eval scorecard [RUN_ID] --agent <RUN_AGENT_ID_OR_LABEL> --json` is run-first. If `RUN_ID` is omitted, the CLI selects the latest run in the workspace. If the run has multiple agents in non-interactive mode, pass `--agent`.

Structured output is an envelope:

```json
{
  "candidate": {
    "workspace_id": "<WORKSPACE_ID>",
    "run_id": "<RUN_ID>",
    "run_name": "<name>",
    "run_status": "completed",
    "run_agent_id": "<RUN_AGENT_ID>",
    "run_agent_label": "<label>",
    "official_pack_mode": "full"
  },
  "baseline": null,
  "scorecard": {},
  "comparison": null,
  "release_gate": null
}
```

When a baseline bookmark exists, the CLI also fetches `/v1/compare` and `/v1/release-gates/evaluate`, then fills `baseline`, `comparison`, and `release_gate`. Use this envelope for regression-style summaries; use `run scorecard` for the raw per-agent scorecard only.

## Replay JSON
`agentclash replay get <RUN_AGENT_ID> --json` returns replay state, optional replay metadata, steps, and pagination:

```json
{
  "state": "ready",
  "run_agent_id": "<RUN_AGENT_ID>",
  "run_id": "<RUN_ID>",
  "run_agent_status": "completed",
  "replay": {
    "id": "<REPLAY_ID>",
    "artifact_id": "<ARTIFACT_ID>",
    "summary": {},
    "latest_sequence_number": 12,
    "event_count": 42,
    "created_at": "<timestamp>",
    "updated_at": "<timestamp>"
  },
  "steps": [
    {
      "sequence_number": 12,
      "step_type": "<step type>",
      "summary": "<summary>"
    }
  ],
  "pagination": {
    "next_cursor": "50",
    "limit": 50,
    "total_steps": 120,
    "has_more": true
  }
}
```

Use replay to verify whether scorecard and judge claims match observable behavior. Prefer referenced `sequence_number` values from failure items or validator sources before paging through the whole replay.

## Failure Review JSON
`agentclash run failures <RUN_ID> --json` returns:

```json
{
  "items": [
    {
      "run_id": "<RUN_ID>",
      "run_agent_id": "<RUN_AGENT_ID>",
      "challenge_identity_id": "<CHALLENGE_ID>",
      "challenge_key": "<challenge_key>",
      "case_key": "<case_key>",
      "item_key": "<item_key>",
      "failure_fingerprint": "frf_...",
      "failure_cluster_key": "frc_...",
      "failure_state": "failed",
      "failed_dimensions": ["correctness"],
      "failed_checks": ["<validator_or_judge_key>"],
      "failure_class": "policy_violation",
      "headline": "<headline>",
      "detail": "<detail>",
      "recommended_action": "<recommended action>",
      "promotable": true,
      "promotion_mode_available": ["full_executable", "output_only"],
      "replay_step_refs": [
        {
          "sequence_number": 12,
          "event_type": "<event type>",
          "kind": "<kind>"
        }
      ],
      "artifact_refs": [
        {
          "key": "<artifact key>",
          "kind": "<kind>",
          "path": "<path>",
          "media_type": "<media type>"
        }
      ],
      "judge_refs": [
        {
          "key": "<judge key>",
          "kind": "llm_judge",
          "state": "fail",
          "normalized_score": 0.2,
          "reason": "<reason>",
          "sequence_number": 12,
          "event_type": "<event type>"
        }
      ],
      "metric_refs": [
        {
          "key": "<metric key>",
          "metric_type": "<type>",
          "state": "available",
          "numeric_value": 123
        }
      ],
      "evidence_tier": "hosted_structured",
      "severity": "blocking"
    }
  ],
  "clusters": [
    {
      "failure_cluster_key": "frc_...",
      "representative_failure_fingerprint": "frf_...",
      "count": 2,
      "promotable_count": 1,
      "severity": "blocking",
      "failure_state": "failed",
      "failure_class": "policy_violation",
      "evidence_tier": "hosted_structured",
      "challenge_keys": ["<challenge_key>"],
      "case_keys": ["<case_key>"],
      "run_agent_ids": ["<RUN_AGENT_ID>"],
      "headline": "<headline>",
      "recommended_action": "<recommended action>"
    }
  ],
  "next_cursor": "<cursor>"
}
```

Filters supported by the CLI:

- `--agent <RUN_AGENT_ID>`
- `--severity info|warning|blocking`
- `--class <failure_class>`
- `--evidence-tier none|native_structured|hosted_structured|hosted_black_box|derived_summary`
- `--cluster <FAILURE_CLUSTER_KEY>`
- `--cursor <NEXT_CURSOR>`
- `--limit <COUNT>`

Failure classes currently accepted by the API are `incorrect_final_output`, `tool_selection_error`, `tool_argument_error`, `retrieval_grounding_failure`, `policy_violation`, `timeout_or_budget_exhaustion`, `sandbox_failure`, `dependency_resolution_failure`, `malformed_output`, `flaky_non_deterministic`, `insufficient_evidence`, and `other`.

## Stateful Reads and Exit Codes
Ranking, scorecard, and replay reads can be ready, pending, or errored.

- Pending responses use HTTP 202 and include `state: "pending"` plus `message`. The CLI prints the raw payload in structured mode and exits successfully.
- Errored responses use HTTP 409 and include `state: "errored"` plus `message`. The CLI prints the raw payload in structured mode and exits with code 1.
- Common messages include `ranking is not ready yet`, `scorecard generation is pending`, `scorecard generation failed or scorecard data is unavailable`, and `replay generation is pending`.

When a read is pending, do not infer pass/fail. Re-check later or inspect `run get`, `run agents`, and `run events`. When a read is errored, report the state and switch to available run, failure, event, and artifact evidence.

## Evidence-First Interpretation
Use this ordering when writing findings:

1. Ranking: winner, sort field, score gap, evidence warnings, unavailable scores.
2. Scorecard dimensions: failed gates, low scores, unavailable/error dimensions, `overall_reason`, warnings.
3. Validator and metric details: exact failed checks, source refs, numeric values, reasons.
4. LLM judge results: judge key, mode, normalized score, confidence, variance, sample/model counts, concise rationale if present in payload.
5. Failure review: `failure_state`, `failure_class`, `severity`, `evidence_tier`, refs, `recommended_action`, promotability.
6. Replay: sequence refs and behavior observed around final output, tool calls, sandbox failures, or malformed outputs.
7. Artifacts: downloaded only when needed to verify file output or user-visible content.

Do not say "the judge proved" something. Say "the scorecard/judge reports X, and replay/artifact evidence Y supports or contradicts it."

## Expected Output
- A winner or status summary that names the run and agent IDs.
- A short list of findings with evidence pointers to scorecard fields, failure fingerprints or clusters, replay sequence numbers, and artifact IDs or paths.
- A distinction between confirmed evidence, judge rationale, unavailable data, and pending/errored reads.
- Follow-up commands a reviewer can run exactly.

## Failure Modes
- Missing workspace: run `agentclash link`, `agentclash workspace use <id>`, pass `--workspace`, or set `AGENTCLASH_WORKSPACE`.
- Multiple run agents for `eval scorecard`: pass `--agent <RUN_AGENT_ID_OR_LABEL>`.
- Ranking pending: wait for scoring or inspect `run get`, `run agents`, and `run events`.
- Scorecard pending: wait for scorecard generation; do not report pass/fail yet.
- Scorecard errored: the run agent may have failed, or scorecard data may be unavailable. Use failures, replay, events, and artifacts instead.
- Replay pending or noisy: use failure `replay_step_refs` or validator `source.sequence` before paging broadly.
- `artifact list --run` fails: the command does not exist. Use workspace-wide `artifact list --json` or artifact refs from failure/scorecard evidence.
- Unknown `--sort-by`: use a built-in sort field or a dimension key that exists in the scorecard.

## Safety Notes
- Do not paste secrets, private artifact contents, raw provider keys, customer data, or long logs into chat.
- Do not overstate LLM judge rationale as ground truth.
- Download artifacts only when needed, and prefer targeted artifact IDs from evidence refs.
- Scorecards and failures can contain model outputs and tool traces. Quote only the minimum needed to support the finding.
- Read commands are safe, but follow-up mutation commands such as failure promotion belong to `agentclash-regression-flywheel` and should be intentional.

## Report Back Format
```text
Outcome: <winner/status/pending/errored>
Run: <RUN_ID>
Agent(s): <RUN_AGENT_ID or labels>
Evidence:
- <claim> | scorecard=<field/path> | replay=<sequence or none> | artifact=<id/path or none>
Findings:
- <impact> -> <next action>
Uncertainties:
- <pending/errored/unavailable evidence, or none>
Follow-up commands:
- agentclash run ranking <RUN_ID> --json
- agentclash run failures <RUN_ID> --json
- agentclash run scorecard <RUN_AGENT_ID> --json
- agentclash replay get <RUN_AGENT_ID> --limit 50 --json
Next skill: <agentclash-regression-flywheel | challenge-pack skill | none>
```

## Related Skills
- `agentclash-cli-setup`
- `agentclash-eval-runner`
- `agentclash-scorecard-reader`
- `agentclash-regression-flywheel`
- `agentclash-ci-release-gate`

## Related Docs
- `/docs-md/concepts/replay-and-scorecards`
- `/docs-md/concepts/runs-and-evals`
- `/docs-md/concepts/artifacts`
- `/docs-md/reference/cli`
````

---

# Regression Flywheel Skill

Use when inspecting AgentClash run failure-review items, promoting useful failures into regression suites, editing regression suites or cases, and verifying suite-only reruns.

Source: https://www.agentclash.dev/docs/agent-skills/agentclash-regression-flywheel
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/agentclash-regression-flywheel

Canonical source: `web/content/agent-skills/agentclash-regression-flywheel/SKILL.md`

Markdown export: `/docs-md/agent-skills/agentclash-regression-flywheel`

## Use This Skill When

Use when inspecting AgentClash run failure-review items, promoting useful failures into regression suites, editing regression suites or cases, and verifying suite-only reruns.

## Full SKILL.md

````markdown
---
name: agentclash-regression-flywheel
description: Use when inspecting AgentClash run failure-review items, promoting useful failures into regression suites, editing regression suites or cases, and verifying suite-only reruns.
metadata:
  agentclash.role: regression
  agentclash.version: "1"
  agentclash.requires_cli: "true"
---

# AgentClash Regression Flywheel

## Purpose
Turn understood AgentClash failures into durable regression coverage, then verify the promoted cases with suite-only runs.

## Use When
- A user wants to inspect failure-review items and decide which failures should become regression cases.
- A failure item has `promotable: true` and a useful `promotion_mode_available`.
- A regression suite needs to be created, renamed, archived, reactivated, or used for verification.
- A regression case needs title, description, status, or severity cleanup after promotion.
- A fix needs to be checked against a targeted regression suite or case.

## Do Not Use When
- The run has not produced failure evidence yet; use `agentclash-eval-runner` to run or follow it.
- The user only needs to interpret a scorecard, replay, artifact, or ranking; use `agentclash-scorecard-reader` first.
- The challenge pack itself needs authoring, validation, or publishing; use the challenge-pack skills.
- The task is to configure release gates or CI promotion policy; use `agentclash-ci-release-gate`.

## Inputs Needed
- Workspace ID or configured workspace context.
- Run ID containing failure-review items.
- Source challenge pack ID for the target regression suite.
- Target suite ID, or the suite name/details needed to create one.
- Failure selector: `challenge_identity_id` from `run failures --json`, plus `run_agent_id` when more than one agent failed the same challenge.
- Promotion mode from the failure item's `promotion_mode_available`: `full_executable` or `output_only`.
- Case title, optional failure summary, optional severity, and any validator overrides.
- Deployment and challenge pack version IDs/selectors for a suite-only verification run.

## Environment
Use hosted production by default unless the user intentionally targets local or self-hosted infrastructure:

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
agentclash auth status
agentclash workspace use <WORKSPACE_ID>
```

All commands in this skill require workspace context. Workspace resolution follows the CLI setup rules: `--workspace`, `AGENTCLASH_WORKSPACE`, saved config, or `.agentclash.yaml`.

## Procedure
1. Read failure-review items for the run and group them by `failure_cluster_key`, `severity`, `failure_class`, and `promotable`.
2. Use `agentclash-scorecard-reader` evidence first: confirm the failed dimensions, judge/validator refs, replay refs, and artifact refs before promotion.
3. Choose an existing active suite whose `source_challenge_pack_id` matches the run source pack, or create one for that source pack.
4. Check for duplicates in the target suite by source failure cluster, failure fingerprint, challenge key, case key, and existing active/proposed cases.
5. Promote the failure with `run promote-failure <RUN_ID> <CHALLENGE_IDENTITY_ID>`.
6. Review the generated case JSON and update title, description, status, or severity if needed.
7. Run a suite-only verification against the updated deployment and report pass/fail coverage.

## Inspect Failures
Start with failure-review items:

```bash
agentclash run failures <RUN_ID> --json
agentclash run failures <RUN_ID> --agent <RUN_AGENT_ID> --json
agentclash run failures <RUN_ID> --severity blocking --json
agentclash run failures <RUN_ID> --class policy_violation --json
agentclash run failures <RUN_ID> --cluster <FAILURE_CLUSTER_KEY> --limit 50 --json
```

Supported filters are:

- `--agent <RUN_AGENT_ID>`
- `--severity info|warning|blocking`
- `--class <failure_class>`
- `--evidence-tier none|native_structured|hosted_structured|hosted_black_box|derived_summary`
- `--cluster <FAILURE_CLUSTER_KEY>`
- `--cursor <NEXT_CURSOR>`
- `--limit <COUNT>`

Failure classes currently accepted by the API are `incorrect_final_output`, `tool_selection_error`, `tool_argument_error`, `retrieval_grounding_failure`, `policy_violation`, `timeout_or_budget_exhaustion`, `sandbox_failure`, `dependency_resolution_failure`, `malformed_output`, `flaky_non_deterministic`, `insufficient_evidence`, and `other`.

The fields that matter for promotion are:

```json
{
  "items": [
    {
      "run_id": "<RUN_ID>",
      "run_agent_id": "<RUN_AGENT_ID>",
      "challenge_identity_id": "<CHALLENGE_IDENTITY_ID>",
      "challenge_key": "<challenge_key>",
      "case_key": "<case_key>",
      "item_key": "<item_key>",
      "failure_fingerprint": "frf_...",
      "failure_cluster_key": "frc_...",
      "failure_state": "failed",
      "failed_dimensions": ["correctness"],
      "failed_checks": ["<validator_or_judge_key>"],
      "failure_class": "policy_violation",
      "headline": "<headline>",
      "detail": "<detail>",
      "recommended_action": "<recommended action>",
      "promotable": true,
      "promotion_mode_available": ["full_executable", "output_only"],
      "replay_step_refs": [],
      "artifact_refs": [],
      "judge_refs": [],
      "metric_refs": [],
      "evidence_tier": "hosted_structured",
      "severity": "blocking"
    }
  ],
  "clusters": [
    {
      "failure_cluster_key": "frc_...",
      "representative_failure_fingerprint": "frf_...",
      "count": 2,
      "promotable_count": 1,
      "severity": "blocking",
      "failure_state": "failed",
      "failure_class": "policy_violation",
      "evidence_tier": "hosted_structured",
      "challenge_keys": ["<challenge_key>"],
      "case_keys": ["<case_key>"],
      "run_agent_ids": ["<RUN_AGENT_ID>"],
      "headline": "<headline>",
      "recommended_action": "<recommended action>"
    }
  ],
  "next_cursor": "<cursor>"
}
```

Promote only when `promotable` is true and the chosen `promotion_mode` appears in `promotion_mode_available`.

## Manage Suites
List and inspect suites:

```bash
agentclash regression-suite list --json
agentclash regression-suite get <SUITE_ID> --json
agentclash regression-suite cases <SUITE_ID> --json
```

`regression-suite` also has the alias `regression-suites`.

Create a suite:

```bash
agentclash regression-suite create \
  --source-challenge-pack-id <CHALLENGE_PACK_ID> \
  --name "Checkout regressions" \
  --description "Failures promoted from checkout evals" \
  --default-gate-severity warning \
  --json
```

Equivalent `--from-file` payload:

```json
{
  "source_challenge_pack_id": "<CHALLENGE_PACK_ID>",
  "name": "Checkout regressions",
  "description": "Failures promoted from checkout evals",
  "default_gate_severity": "warning"
}
```

Exact suite create rules:

- `source_challenge_pack_id` is required and must identify a challenge pack visible to the workspace.
- `name` is required.
- `default_gate_severity` is optional and defaults to `warning`.
- Allowed severities are `info`, `warning`, and `blocking`.
- New suites are created with `status: "active"` and `source_mode: "derived_only"`.

Update a suite:

```bash
agentclash regression-suite update <SUITE_ID> \
  --name "Checkout regressions" \
  --description "Current production blockers" \
  --status active \
  --default-gate-severity blocking \
  --json
```

Equivalent `--from-file` payload:

```json
{
  "name": "Checkout regressions",
  "description": "Current production blockers",
  "status": "active",
  "default_gate_severity": "blocking"
}
```

Exact suite update rules:

- At least one field must be provided.
- `status` must be `active` or `archived`.
- `default_gate_severity` must be `info`, `warning`, or `blocking`.
- Archived suites cannot accept new promotions.

Suite JSON includes:

```json
{
  "id": "<SUITE_ID>",
  "workspace_id": "<WORKSPACE_ID>",
  "source_challenge_pack_id": "<CHALLENGE_PACK_ID>",
  "name": "Checkout regressions",
  "description": "Current production blockers",
  "status": "active",
  "source_mode": "derived_only",
  "default_gate_severity": "blocking",
  "case_count": 3,
  "created_by_user_id": "<USER_ID>",
  "created_at": "<timestamp>",
  "updated_at": "<timestamp>"
}
```

`regression-suite list --json` prints `{ "items": [...] }` from the CLI. It does not expose the API's `total`, `limit`, or `offset` fields today.

## Promote Failures
The promotion command shape is:

```bash
agentclash run promote-failure <RUN_ID> <CHALLENGE_IDENTITY_ID> \
  --run-agent <RUN_AGENT_ID> \
  --suite <SUITE_ID> \
  --promotion-mode full_executable \
  --title "Policy answer must refuse credential disclosure" \
  --failure-summary "Agent disclosed a credential-like value instead of refusing." \
  --severity blocking \
  --json
```

Important exact details:

- The second positional argument is `challenge_identity_id` from `run failures --json`, not `failure_fingerprint` or `failure_cluster_key`.
- Pass `--run-agent` when the same challenge identity failed for multiple run agents; otherwise the backend returns `failure_review_item_ambiguous`.
- `--suite`, `--promotion-mode`, and `--title` map to required JSON fields.
- `--promotion-mode` should be `full_executable` or `output_only`, and it must be present in the failure item's `promotion_mode_available`.
- `--severity` is optional. If omitted, `policy_violation` and `sandbox_failure` default to `blocking`; other failure classes default to `warning`.
- The CLI has no `--status`, `--validator-overrides`, or `--metadata` flags for promotion. Use `--from-file` for those fields.

Full `--from-file` payload:

```json
{
  "run_agent_id": "<RUN_AGENT_ID>",
  "suite_id": "<SUITE_ID>",
  "promotion_mode": "full_executable",
  "title": "Policy answer must refuse credential disclosure",
  "failure_summary": "Agent disclosed a credential-like value instead of refusing.",
  "status": "proposed",
  "severity": "blocking",
  "validator_overrides": {
    "judge_threshold_overrides": {
      "policy_refusal": 0.9
    },
    "assertion_toggles": {
      "must_refuse": true
    }
  },
  "metadata": {
    "source": "triage",
    "source_challenge_key": "<challenge_key>",
    "source_failure_fingerprint": "frf_...",
    "source_failure_cluster_key": "frc_..."
  }
}
```

Exact promotion rules:

- `suite_id` is required.
- `title` is required.
- `status`, when provided, must be `active` or `proposed`.
- `severity`, when provided, must be `info`, `warning`, or `blocking`.
- `validator_overrides` may contain only `judge_threshold_overrides` and `assertion_toggles`.
- `metadata` must be a JSON object or null.
- If you want `source_challenge_key`, `source_failure_fingerprint`, or `source_failure_cluster_key` on the case response for duplicate checks, include those exact keys in `metadata`.
- The target suite must be active and must have the same `source_challenge_pack_id` as the run source pack.
- The failure item must be promotable. Items without a challenge input set or with insufficient reproduction context may have no available promotion modes.

`run promote-failure --json` prints the regression case object directly. The HTTP status is 201 when a case is created and 200 when the same suite, run agent, and challenge identity already map to an existing case; the CLI JSON output is the case in both paths.

## Review and Edit Cases
List cases in a suite:

```bash
agentclash regression-suite cases <SUITE_ID> --json
```

Update a case:

```bash
agentclash regression-suite case update <CASE_ID> \
  --title "Policy answer must refuse credential disclosure" \
  --description "Covers credential disclosure requests in support chat." \
  --status active \
  --severity blocking \
  --json
```

Equivalent `--from-file` payload:

```json
{
  "title": "Policy answer must refuse credential disclosure",
  "description": "Covers credential disclosure requests in support chat.",
  "status": "active",
  "severity": "blocking"
}
```

Exact case update rules:

- At least one field must be provided.
- `status` must be `proposed`, `active`, `muted`, `archived`, or `rejected`.
- `severity` must be `info`, `warning`, or `blocking`.
- There is no CLI command today to create a regression case directly, fetch a single case directly, or patch `expected_contract`, `payload_snapshot`, `validator_overrides`, or `metadata` after promotion.

Case JSON includes:

```json
{
  "id": "<CASE_ID>",
  "suite_id": "<SUITE_ID>",
  "workspace_id": "<WORKSPACE_ID>",
  "title": "Policy answer must refuse credential disclosure",
  "description": "Covers credential disclosure requests in support chat.",
  "status": "active",
  "severity": "blocking",
  "promotion_mode": "full_executable",
  "source_run_id": "<RUN_ID>",
  "source_run_agent_id": "<RUN_AGENT_ID>",
  "source_replay_id": "<REPLAY_ID>",
  "source_challenge_pack_version_id": "<CHALLENGE_PACK_VERSION_ID>",
  "source_challenge_input_set_id": "<INPUT_SET_ID>",
  "source_challenge_identity_id": "<CHALLENGE_IDENTITY_ID>",
  "source_challenge_key": "<challenge_key>",
  "source_case_key": "<case_key>",
  "source_item_key": "<item_key>",
  "source_failure_fingerprint": "frf_...",
  "source_failure_cluster_key": "frc_...",
  "evidence_tier": "hosted_structured",
  "failure_class": "policy_violation",
  "failure_summary": "<summary>",
  "payload_snapshot": {},
  "expected_contract": {},
  "validator_overrides": {},
  "metadata": {},
  "latest_promotion": {
    "id": "<PROMOTION_ID>",
    "workspace_regression_case_id": "<CASE_ID>",
    "source_run_id": "<RUN_ID>",
    "source_run_agent_id": "<RUN_AGENT_ID>",
    "source_event_refs": [],
    "promoted_by_user_id": "<USER_ID>",
    "promotion_reason": "<summary>",
    "promotion_snapshot": {},
    "created_at": "<timestamp>"
  },
  "validation": {
    "status": "not_validated",
    "run_count": 0,
    "failure_count": 0,
    "pass_count": 0,
    "reproduction_threshold": 0.6,
    "required_runs": 5,
    "remaining_runs": 5,
    "recommended_action": "<action>"
  },
  "created_at": "<timestamp>",
  "updated_at": "<timestamp>"
}
```

Validation status values are `not_validated`, `collecting_signal`, `reproducing`, `passing`, and `flaky`.

## Duplicate and Quality Checks
Before promotion:

- Compare the target suite's existing cases by `source_case_key`, `status`, and any available `source_failure_cluster_key`, `source_failure_fingerprint`, or `source_challenge_key` fields.
- Prefer updating or reusing an existing active/proposed case when a failure is the same behavior, even if it came from a different run.
- Promote only failures with concrete replay, judge, validator, metric, or artifact evidence. Avoid promoting `insufficient_evidence` unless the goal is explicitly to track missing evidence.
- Use `full_executable` when the failure has a challenge input set and enough structured evidence to replay the case. Use `output_only` when only the final output contract can be captured.

Backend duplicate protection is intentionally narrow: the same suite, run agent, and challenge identity returns the existing case. Cross-run duplicates and cross-suite duplicates are reviewer decisions.

## Verify Suite-Only
Use `eval start` when selectors can be names, slugs, or exact suite names:

```bash
agentclash eval start \
  --pack <PACK_ID_OR_SLUG_OR_EXACT_NAME> \
  --pack-version <VERSION_ID_OR_VERSION_NUMBER> \
  --deployment <DEPLOYMENT_ID_OR_EXACT_NAME> \
  --scope suite_only \
  --suite <SUITE_ID_OR_EXACT_NAME> \
  --follow
```

Use `run create` when automation already has IDs:

```bash
agentclash run create \
  --challenge-pack-version <CHALLENGE_PACK_VERSION_ID> \
  --deployments <AGENT_DEPLOYMENT_ID> \
  --scope suite_only \
  --suite <SUITE_ID> \
  --case <CASE_ID> \
  --follow
```

Exact suite-only notes:

- `--scope suite_only` requires at least one `--suite` or `--case`.
- In `eval start`, `--suite` can resolve a suite ID or exact suite name; `--case` is a case ID.
- In `run create`, `--suite` and `--case` are ID-first.
- `--repetitions >= 2` does not support `--scope suite_only`, `--suite`, or `--case`.
- After the run, inspect `agentclash run get <RUN_ID> --json` for `regression_coverage`.

`regression_coverage` contains:

```json
{
  "regression_coverage": {
    "suites": [
      {
        "id": "<SUITE_ID>",
        "name": "Checkout regressions",
        "case_count": 3,
        "pass_count": 2,
        "fail_count": 1
      }
    ],
    "unmatched_cases": [
      {
        "id": "<CASE_ID>",
        "title": "<case title>",
        "outcome": "fail"
      }
    ]
  }
}
```

Then inspect:

```bash
agentclash run failures <VERIFICATION_RUN_ID> --json
agentclash eval scorecard <VERIFICATION_RUN_ID> --agent <RUN_AGENT_ID_OR_LABEL> --json
agentclash run ranking <VERIFICATION_RUN_ID> --json
```

## Expected Output
- A small set of promoted cases with clear source evidence, status, severity, suite, and promotion mode.
- No duplicate active/proposed cases for the same behavior in the target suite.
- A suite-only verification run ID and result.
- A concise explanation of whether the fix passes, fails, or needs more validation runs.

## Failure Modes
- Missing workspace: run `agentclash link`, `agentclash workspace use <id>`, pass `--workspace`, or set `AGENTCLASH_WORKSPACE`.
- `source_challenge_pack_id is required`: create the suite with the source challenge pack ID, not a challenge pack version ID.
- `challenge_pack_not_found`: the source challenge pack is not visible to the workspace.
- `regression_suite_name_conflict`: rename the suite or reuse the existing active suite.
- `regression_suite_archived`: reactivate the suite or pick an active one.
- `regression_suite_pack_mismatch`: choose a suite whose `source_challenge_pack_id` matches the run source pack.
- `failure_review_item_not_found`: use the `challenge_identity_id` from `run failures --json`, not the fingerprint or cluster key.
- `failure_review_item_ambiguous`: pass `--run-agent <RUN_AGENT_ID>`.
- `failure_not_promotable`: do not promote; collect better evidence or run with a challenge input set.
- `promotion_mode_unavailable`: choose a mode listed in `promotion_mode_available`.
- `invalid_promotion_overrides`: use only `judge_threshold_overrides` and `assertion_toggles` with the correct map value types.
- `--scope suite_only requires at least one --suite or --case`: add a suite or case selector.

## Safety Notes
- Promotion and suite/case updates mutate shared workspace state. Confirm intent before changing production suites.
- Do not put secrets, customer data, raw artifact contents, or long traces into case titles, summaries, descriptions, metadata, or chat.
- Prefer `status: "proposed"` when a reviewer still needs to approve the case.
- Archive or reject noisy cases instead of leaving weak regressions active.
- Keep suite-only verification focused; avoid broad full-pack reruns when a targeted suite is enough.

## Report Back Format
```text
Run: <RUN_ID>
Failure reviewed:
- challenge_identity_id=<id> run_agent_id=<id> cluster=<frc_...> class=<failure_class> severity=<severity>
Suite: <SUITE_ID> (<name>)
Duplicate check: <none found | reused CASE_ID | updated CASE_ID>
Promotion:
- case=<CASE_ID> mode=<full_executable|output_only> status=<proposed|active> severity=<severity>
Case edits: <none | title/description/status/severity changes>
Verification:
- command=<exact suite-only command>
- run=<VERIFICATION_RUN_ID>
- regression_coverage=<pass/fail counts or unavailable>
Next action: <ship/fix/rerun/needs-review>
```

## Related Skills
- `agentclash-cli-setup`
- `agentclash-eval-runner`
- `agentclash-scorecard-reader`
- `agentclash-regression-flywheel`
- `agentclash-ci-release-gate`

## Related Docs
- `/docs-md/concepts/replay-and-scorecards`
- `/docs-md/concepts/runs-and-evals`
- `/docs-md/reference/cli`
````

---

# CI Release Gate Skill

Use when wiring AgentClash manifest-based CI gates, deciding whether a PR should run AgentClash, resolving baselines, running `agentclash ci run`, interpreting gate exit codes, collecting CI artifacts, or configuring regression promotion policy in GitHub Actions.

Source: https://www.agentclash.dev/docs/agent-skills/agentclash-ci-release-gate
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/agentclash-ci-release-gate

Canonical source: `web/content/agent-skills/agentclash-ci-release-gate/SKILL.md`

Markdown export: `/docs-md/agent-skills/agentclash-ci-release-gate`

## Use This Skill When

Use when wiring AgentClash manifest-based CI gates, deciding whether a PR should run AgentClash, resolving baselines, running `agentclash ci run`, interpreting gate exit codes, collecting CI artifacts, or configuring regression promotion policy in GitHub Actions.

## Full SKILL.md

````markdown
---
name: agentclash-ci-release-gate
description: Use when wiring AgentClash manifest-based CI gates, deciding whether a PR should run AgentClash, resolving baselines, running `agentclash ci run`, interpreting gate exit codes, collecting CI artifacts, or configuring regression promotion policy in GitHub Actions.
metadata:
  agentclash.role: ci
  agentclash.version: "1"
  agentclash.requires_cli: "true"
---

# AgentClash CI Release Gate

## Purpose
Wire an AgentClash candidate build/deployment into a repo-tracked CI manifest, compare it against a baseline run, and turn the release gate verdict into CI status.

## Use When
- A pull request or mainline workflow should run AgentClash only when relevant files or labels match.
- A user needs a `.agentclash/ci.yaml` manifest for a candidate agent, workload, baseline, gate, and regression promotion policy.
- CI should fail on AgentClash release gate failures, warnings, insufficient evidence, setup errors, run timeouts, or candidate run failures.
- A GitHub Actions workflow should use the repo-local `.github/actions/agentclash-ci` composite action.

## Do Not Use When
- The agent build spec does not exist yet; use `agentclash-agent-build-author`.
- Runtime profiles, provider accounts, model aliases, or workspace tools are not configured; use `agentclash-runtime-resources-setup`.
- The challenge pack or input set is not validated and published; use the challenge-pack skills first.
- The task is only to read scorecards or failure evidence from an existing run; use `agentclash-scorecard-reader` or `agentclash-regression-flywheel`.

## Environment
Use hosted production by default:

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
export AGENTCLASH_TOKEN="<token>"
export AGENTCLASH_WORKSPACE="<workspace-id>"
```

In GitHub Actions, pass tokens through secrets or the composite action inputs. Do not print tokens. `ci run` exits with code `10` and a JSON error envelope when workspace context is missing, so set `AGENTCLASH_WORKSPACE`, pass `--workspace`, or use saved CLI workspace config.

## Manifest Shape
Create a sample with:

```bash
agentclash ci init .agentclash/ci.yaml
agentclash ci init .agentclash/ci.yaml --force
```

The CLI sample manifest is:

```yaml
version: 1
trigger:
  paths:
    - .agentclash/agent.json
    - prompts/**
    - tools/**
  labels:
    - agentclash/eval
candidate:
  build:
    agent_build_id: 00000000-0000-0000-0000-000000000001
    spec_file: .agentclash/agent.json
  deployment:
    name: pr-candidate
    runtime_profile_id: 00000000-0000-0000-0000-000000000002
    provider_account_id: 00000000-0000-0000-0000-000000000003
    model_alias_id: 00000000-0000-0000-0000-000000000004
evaluation:
  challenge_pack_version_id: 00000000-0000-0000-0000-000000000005
  input_set_id: 00000000-0000-0000-0000-000000000006
  regression_suites:
    - 00000000-0000-0000-0000-000000000007
baseline:
  run_id: 00000000-0000-0000-0000-000000000008
  refresh: manual
  max_age_days: 30
gate:
  fail_on: regression
regressions:
  promote_failures: proposed
```

Exact fields the current CLI parses:

- `version`: must be `1`.
- `trigger.paths`: required, nonblank doublestar globs; `trigger.labels`: optional labels that can force a run.
- `candidate.build.agent_build_id`: required existing agent build ID.
- `candidate.build.spec_file`: required relative path inside the repository; absolute paths and `..` escapes are rejected by `ci run`.
- `candidate.deployment.name`: optional deployment name; if blank, `ci run` generates `agentclash-ci-<unix>`.
- `candidate.deployment.runtime_profile_id`: required.
- `candidate.deployment.provider_account_id` and `candidate.deployment.model_alias_id`: optional; remote validation checks both exist and rejects a model alias whose `provider_account_id` conflicts with the provider account field.
- `evaluation.challenge_pack_version_id`: required.
- `evaluation.input_set_id`, `evaluation.regression_suites`, and `evaluation.regression_cases`: optional; blank regression entries are invalid.
- `baseline.run_id`: locked baseline run, preferred for PR gates.
- `baseline.run_agent_id`: optional but only valid with `baseline.run_id`.
- `baseline.deployment_id`: moving baseline selector; mutually exclusive with `baseline.run_id`.
- `baseline.refresh`: optional `manual`, `propose`, or `auto_on_main`; default behavior is `manual`.
- `baseline.max_age_days`: optional non-negative freshness limit; `0` means no age check.
- `gate.fail_on`: required, one of `regression`, `warning`, or `insufficient_evidence`.
- `gate.policy_file`: optional schema field, but the current `agentclash ci run` implementation does not read or post it.
- `regressions.promote_failures`: required, one of `disabled`, `proposed`, or `auto_on_main`.

Important gate fidelity note: today `ci run` posts `baseline_run_id`, `candidate_run_id`, and optional run-agent IDs to `/v1/release-gates/evaluate`. It does not pass `gate.fail_on`, does not load `gate.policy_file`, and has no `--fail-on` or `--policy-file` flag. The backend normalizes an empty policy to the default release gate policy. Do not claim manifest gate fields customize evaluation until the CLI source wires that behavior.

## Validation Commands
Validate locally first:

```bash
agentclash ci validate .agentclash/ci.yaml
agentclash ci validate .agentclash/ci.yaml --json
```

Use remote validation when workspace credentials are available:

```bash
agentclash ci validate .agentclash/ci.yaml --remote --json
```

Structured `validate` output includes:

```json
{
  "path": ".agentclash/ci.yaml",
  "valid": true,
  "manifest": {},
  "remote": {
    "workspace_id": "<WORKSPACE_ID>",
    "valid": true,
    "checks": [
      {
        "field": "candidate.build.agent_build_id",
        "resource": "agent_build",
        "id": "<AGENT_BUILD_ID>",
        "valid": true,
        "code": "ok",
        "message": "agent build is accessible in the selected workspace"
      }
    ]
  }
}
```

Remote validation checks the agent build, runtime profile, provider account, model alias, challenge pack version, input set, regression suites, regression cases, and baseline compatibility. API failures that cannot be reduced to a field problem appear as `remote API error: ...`.

## Should Run
Use this before spending hosted evaluation budget:

```bash
agentclash ci should-run --manifest .agentclash/ci.yaml --base origin/main --head HEAD --json
agentclash ci should-run --manifest .agentclash/ci.yaml --changed-file prompts/refund.md --labels agentclash/eval --json
```

`--changed-file` may be repeated. `--labels` accepts comma-separated or repeated values. If changed files are omitted and refs are present, the CLI derives files with `git diff --name-only --diff-filter=ACDMRTUXB <base>...<head>`. Ref defaults can come from `AGENTCLASH_CI_BASE`, `GITHUB_BASE_REF` as `origin/<base>`, `AGENTCLASH_CI_HEAD`, `GITHUB_SHA`, and `HEAD` when a base is set.

JSON output shape:

```json
{
  "path": ".agentclash/ci.yaml",
  "should_run": true,
  "reason": "changed files matched trigger.paths",
  "changed_files": ["prompts/refund.md"],
  "labels": ["agentclash/eval"],
  "checked_path_globs": ["prompts/**"],
  "checked_labels": ["agentclash/eval"],
  "matched_paths": [{"pattern": "prompts/**", "file": "prompts/refund.md"}],
  "matched_labels": ["agentclash/eval"]
}
```

The decision is an OR: matched paths or matched labels make `should_run: true`. Reasons are exactly:

- `changed files matched trigger.paths and labels matched trigger.labels`
- `changed files matched trigger.paths`
- `labels matched trigger.labels`
- `no changed files or labels were provided`
- `no changed files or labels matched manifest triggers`

## Baseline Resolution
Resolve and print the exact baseline before running the gate:

```bash
agentclash ci baseline --manifest .agentclash/ci.yaml --json
```

For `baseline.run_id`, strategy is `locked_run` and source is `baseline.run_id`. The run must be in the selected workspace, completed, compatible with `evaluation.challenge_pack_version_id` and optional `evaluation.input_set_id`, and within `baseline.max_age_days` when set. `baseline.run_agent_id` is resolved against that run when present.

For `baseline.deployment_id`, strategy is `deployment_latest_completed` and source is `baseline.deployment_id`. The CLI selects the newest completed compatible run whose participant used that deployment and warns that deployment baselines move over time. Prefer `baseline.run_id` for PRs.

Refresh next actions are:

- `manual`: after a successful mainline run, update `baseline.run_id` intentionally in a reviewed change.
- `propose`: after a successful mainline run, open a reviewed change that updates `baseline.run_id`.
- `auto_on_main`: after a successful protected mainline run, automation may update `baseline.run_id` with an auditable commit.

## Run The Gate
Run the manifest workflow:

```bash
agentclash ci run --manifest .agentclash/ci.yaml --json --artifact-dir agentclash-artifacts
agentclash ci run --manifest .agentclash/ci.yaml --json --summary-file agentclash-summary.md
agentclash ci run --manifest .agentclash/ci.yaml --follow --timeout 30m --poll-interval 5s
```

Flags:

- `--manifest`: defaults to `.agentclash/ci.yaml`.
- `--follow`: streams run events only for non-JSON output.
- `--timeout`: duration, default `30m`; `0` disables timeout; negative values exit `10`.
- `--poll-interval`: duration, default `5s`; must be greater than zero.
- `--summary-file`: writes a Markdown gate summary.
- `--github-step-summary`: defaults true and appends when `GITHUB_STEP_SUMMARY` is set.
- `--artifact-dir`: writes stable JSON artifacts.
- CI metadata overrides: `--ci-provider`, `--ci-repository`, `--ci-pull-request`, `--ci-branch`, `--ci-ref`, `--ci-commit`, `--ci-workflow`, `--ci-workflow-run-id`, `--ci-workflow-run-attempt`, `--ci-workflow-run-url`, `--ci-event`, `--ci-default-branch`.

`ci run` does this in order: validate local manifest, remote-validate resource IDs, create a build version from `candidate.build.spec_file`, mark it ready, create a deployment, resolve the baseline, create a run with `official_pack_mode: "full"` plus optional regression suites/cases, wait for completion, resolve the candidate run agent, optionally fetch scorecard and comparison when reports are enabled, evaluate the release gate, optionally promote regression failures, then write reports.

Structured output includes:

```json
{
  "manifest_path": ".agentclash/ci.yaml",
  "workspace_id": "<WORKSPACE_ID>",
  "remote_validation": {},
  "candidate": {
    "agent_build_id": "<AGENT_BUILD_ID>",
    "build_version_id": "<BUILD_VERSION_ID>",
    "deployment_id": "<DEPLOYMENT_ID>",
    "run_id": "<RUN_ID>",
    "run_agent_id": "<RUN_AGENT_ID>",
    "run_status": "completed",
    "run_url": "<URL>",
    "deployment_name": "pr-candidate",
    "ci_metadata": {}
  },
  "baseline_resolution": {},
  "baseline": {
    "run_id": "<BASELINE_RUN_ID>",
    "run_agent_id": "<BASELINE_RUN_AGENT_ID>",
    "status": "completed"
  },
  "release_gate": {},
  "gate_verdict": "pass",
  "failure_reason": "",
  "reports": {},
  "regression_promotions": {},
  "exit_code": 0
}
```

Exit codes are exact:

- `0`: pass.
- `1`: release gate failed.
- `2`: release gate warning.
- `3`: insufficient gate evidence.
- `10`: invalid manifest, missing workspace for `ci run`, invalid duration flag, invalid CI metadata flag, or local candidate spec error.
- `20`: API/auth failure or report-writing failure after a successful gate.
- `30`: candidate run timed out.
- `31`: candidate run failed before gate evaluation.

Successful terminal run statuses are `completed`, `succeeded`, and `success`. Failed terminal statuses include `failed`, `error`, `errored`, `canceled`, `cancelled`, `aborted`, `timed_out`, `timeout`, and `expired`.

## Reports And Artifacts
Reports are enabled only when `--summary-file`, `GITHUB_STEP_SUMMARY` with `--github-step-summary`, or `--artifact-dir` is set. `--artifact-dir agentclash-artifacts` writes:

- `agentclash-artifacts/run.json` with kind `agentclash.ci.run`.
- `agentclash-artifacts/scorecard.json` with kind `agentclash.ci.scorecard`.
- `agentclash-artifacts/comparison.json` with kind `agentclash.ci.comparison`.
- `agentclash-artifacts/gate.json` with kind `agentclash.ci.gate`.
- `agentclash-artifacts/result.json` with kind `agentclash.ci.result`.

Each artifact is wrapped in an envelope with `schema_version: "2026-05-04"`, `kind`, `generated_at`, `manifest_path`, `workspace_id`, `challenge_pack_version_id`, `candidate`, `baseline`, optional gate policy identity fields, and `payload`.

## Regression Promotion
`regressions.promote_failures` runs only when the gate verdict is `fail`. Modes:

- `disabled`: returns a skipped summary with reason `policy_disabled`.
- `proposed`: creates proposed regression candidates.
- `auto_on_main`: creates active cases only on the default branch outside pull request events; otherwise it blocks with `pull_request_event`, `missing_default_branch`, or `non_default_branch`.

Promotion also blocks when `evaluation.regression_suites` is empty, using reason `no_regression_suites`. It lists candidate run failures with `limit=200`, prefers `full_executable` over `output_only`, skips non-promotable or unsupported failures, and avoids existing cases by `source_challenge_identity_id` or metadata `source_failure_cluster_key` unless the existing case status is `archived` or `rejected`.

`regression_promotions` contains `policy`, optional `case_status`, `created`, `existing`, `skipped`, `blocked`, and `errors`. Created/existing items include `suite_id`, `case_id`, `challenge_identity_id`, `challenge_key`, `failure_cluster_key`, `status`, and `created`.

## GitHub Action
The repo-local composite action is `.github/actions/agentclash-ci`. Example:

```yaml
name: AgentClash gate
on:
  pull_request:

jobs:
  agentclash:
    runs-on: ubuntu-latest
    permissions:
      contents: read
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - id: agentclash
        uses: agentclash/agentclash/.github/actions/agentclash-ci@main
        with:
          token: ${{ secrets.AGENTCLASH_TOKEN }}
          workspace: ${{ secrets.AGENTCLASH_WORKSPACE }}
          api-url: https://api.agentclash.dev
          manifest: .agentclash/ci.yaml
          artifact-dir: agentclash-artifacts
      - uses: actions/upload-artifact@v4
        if: always() && steps.agentclash.outputs['should-run'] == 'true'
        with:
          name: agentclash-ci
          path: |
            ${{ steps.agentclash.outputs['result-file'] }}
            ${{ steps.agentclash.outputs['artifact-dir'] }}/*.json
```

Action inputs are exactly `manifest`, `token`, `workspace`, `api-url`, `install-cli`, `cli-version`, `remote-validate`, `skip-if-unmatched`, `base`, `head`, `changed-files`, `labels`, `artifact-dir`, `result-file`, `timeout`, `poll-interval`, `follow`, and `default-branch`.

Action outputs are exactly `should-run`, `skip-reason`, `run-id`, `gate-verdict`, `exit-code`, `result-file`, and `artifact-dir`.

The action exports `AGENTCLASH_TOKEN`, `AGENTCLASH_WORKSPACE`, and `AGENTCLASH_API_URL`, installs `agentclash@<cli-version>` when `install-cli` is true, runs `ci validate` with `--remote` by default, runs `ci should-run --json`, exits `0` when unmatched and `skip-if-unmatched` is true, then runs `ci run --json --artifact-dir`.

## Failure Modes
- Missing token or API access: confirm `AGENTCLASH_TOKEN` and hosted API URL.
- Missing workspace: set `AGENTCLASH_WORKSPACE`, pass `--workspace`, or configure workspace locally.
- Local validation fails: fix YAML fields before remote calls; unknown YAML fields fail because the decoder uses known fields.
- Remote validation fails: check workspace visibility for build, runtime profile, provider account, model alias, challenge pack version, input set, regression suites/cases, and baseline.
- `should-run` skips unexpectedly: inspect `trigger.paths`, `trigger.labels`, checkout depth, base/head refs, and explicitly passed `changed-files`.
- Candidate spec fails: `candidate.build.spec_file` must be readable JSON at a relative path inside the repo.
- Candidate run is slow: tune `--timeout` and `--poll-interval`.
- `auto_on_main` is blocked: pass default-branch metadata and run from the default branch, not a PR event.

## Report Back Format
```text
Manifest: <path>
Should run: <true/false + reason>
Baseline: <run-id or deployment selector + strategy>
Candidate run: <run-id>
Gate verdict: <pass|warn|fail|insufficient_evidence>
Exit code: <code + meaning>
Artifacts: <result-file and artifact-dir>
Regression candidates: <created/existing/skipped/blocked/errors>
Next command: <exact agentclash command or GitHub Actions fix>
```

## Related Skills
- `agentclash-cli-setup`: authenticate, select workspace, and configure hosted API.
- `agentclash-runtime-resources-setup`: create runtime profiles, provider accounts, model aliases, secrets, and tools.
- `agentclash-agent-build-author`: create the candidate build spec and ready build version.
- `agentclash-agent-deployment-setup`: understand deployment resources used by the CI manifest.
- `agentclash-challenge-pack-validation-publish`: publish the challenge pack version and optional input sets.
- `agentclash-eval-runner`: run ad hoc evals before formal CI gates.
- `agentclash-scorecard-reader`: inspect scorecard, comparison, replay, and failure evidence.
- `agentclash-regression-flywheel`: promote failures and manage regression suites/cases.

## Related Docs
- `/docs-md/guides/ci-cd-agent-gates`
- `/docs-md/guides/ci-cd-workload-recipes`
- `/docs-md/challenge-packs/eval-workflows-and-gates`
````

---

# CLI Reference

Commands, flags, and command groups generated from the current Cobra CLI source.

Source: https://www.agentclash.dev/docs/reference/cli
Markdown export: https://www.agentclash.dev/docs-md/reference/cli

Intro-oriented readers should still start from [Hosted quickstart](../getting-started/quickstart); this page focuses on mechanically listing what `cli/cmd` exposes after each docs rebuild. For repo layout around the CLI, see [Codebase tour](../contributing/codebase-tour).

This page is generated from the Cobra command definitions in `cli/cmd`.

## Global flags

- `--api-url` — API base URL (overrides config)
- `--json` — Output in JSON format
- `--no-color` — Disable color output
- `--output` (`-o`) — Output format: table, json, yaml
- `--quiet` (`-q`) — Suppress non-essential output
- `--verbose` (`-v`) — Enable debug output on stderr
- `--workspace` (`-w`) — Workspace ID (overrides config)
- `--yes` — Skip confirmation prompts

## Command groups

### `agent-harness`

Manage coding-agent harnesses

#### `create`

Create an agent harness

Flags

- `--api-key-secret` — Workspace secret name containing the runner provider API key
- `--auth-mode` default: `api_key_secret` — Harness auth mode: api_key_secret
- `--base-branch` — Base branch for repository work
- `--codex-model` — Runner model override
- `--codex-template` — E2B template override for the harness runner
- `--description` — Harness description
- `--evaluation-config` — Inline JSON evaluation config
- `--evaluation-config-file` — JSON file with validators and LLM judges
- `--execution-config` — Inline JSON execution config
- `--from-file` — JSON file with agent harness spec
- `--harness-kind` default: `codex_e2b` — Harness runner kind: codex_e2b or claude_e2b
- `--name` — Harness name
- `--openai-api-key-secret` — Workspace secret name containing OPENAI_API_KEY
- `--repository-url` — Repository URL for the harness task
- `--task` — Task prompt for the coding harness

#### `execution`

Inspect agent harness executions

##### `cancel <execution-id>`

Cancel an Agent Harness execution

##### `failure-review`

Inspect or edit Agent Harness failure classifications

###### `get <execution-id>`

Get Agent Harness failure review

###### `update <execution-id>`

Update Agent Harness failure review annotations

Flags

- `--from-file` — JSON file with failure review update payload
- `--human-class` — Human-curated failure class
- `--human-payload` — Inline JSON human payload
- `--human-summary` — Human-curated failure summary
- `--suggested-class` — Suggested failure class
- `--suggested-confidence` — Suggested confidence as a decimal
- `--suggested-payload` — Inline JSON suggested payload
- `--suggested-source` — Suggested source: rules or llm
- `--suggested-summary` — Suggested failure summary

##### `get <execution-id>`

Get an agent harness execution

##### `promote-task <execution-id>`

Promote a prior harness run into a private suite task

Flags

- `--failure-class` — Failure class to store with promotion metadata
- `--failure-summary` — Failure summary to store with promotion metadata
- `--from-file` — JSON file with promotion payload
- `--metadata` — Inline JSON promotion metadata
- `--public-prompt` — Sanitized public prompt
- `--suite` — Target Agent Harness suite ID
- `--title` — Promoted private task title

##### `retry <execution-id>`

Retry a terminal Agent Harness execution

Flags

- `--idempotency-key` — Retry idempotency key

#### `executions <harness-id>`

List executions for an agent harness

#### `failures`

Inspect Agent Harness failure summaries

##### `summary`

Summarize Agent Harness failure modes

#### `get <id>`

Get an agent harness

#### `list`

List agent harnesses

#### `run <harness-id>`

Start an agent harness execution

Flags

- `--follow` — Poll until the harness execution reaches a terminal status
- `--message` — Override the harness task prompt for this execution
- `--poll-interval` — Polling interval for --follow

#### `suite`

Manage Agent Harness suites and private task banks

##### `create`

Create an Agent Harness suite/private task bank

Flags

- `--description` — Suite description
- `--from-file` — JSON file with agent harness suite spec
- `--metadata` — Inline JSON suite metadata
- `--name` — Suite name
- `--task-json` — Suite task JSON object; may be repeated

##### `list`

List Agent Harness suites

##### `rankings <suite-id>`

Get Agent Harness suite rankings

Flags

- `--k` — k value for pass@k and pass^k
- `--version-id` — Immutable suite version ID

##### `run <suite-id>`

Start suite runs across one or more harnesses

Flags

- `--harness` — Harness ID to run; may be repeated or comma-separated
- `--task` — Suite task ID filter; may be repeated or comma-separated

##### `tasks <suite-id>`

List public tasks for an Agent Harness suite

### `artifact`

Upload and download artifacts

#### `download <artifactId>`

Download an artifact

Flags

- `--output` (`-O`) — Output file path (defaults to stdout)

#### `list`

List artifacts in the workspace

#### `upload <file>`

Upload an artifact

Flags

- `--metadata` — JSON metadata (optional)
- `--run` — Run ID (optional)
- `--run-agent` — Run agent ID (optional)
- `--type` (required) — Artifact type (required)

### `auth`

Manage authentication

#### `login`

Log in to AgentClash

Flags

- `--device` — Print the verification URL instead of opening the browser automatically
- `--force` — Start a new browser login even if existing credentials are valid

#### `logout`

Log out and remove stored credentials

#### `status`

Show current authentication status

#### `tokens`

Manage CLI access tokens

##### `list`

List your CLI tokens

##### `revoke <token-id>`

Revoke a CLI token

### `baseline`

Manage the workspace-scoped default baseline run

#### `clear`

Clear the default baseline run for the current workspace

#### `set [run]`

Bookmark a run as the default baseline for the current workspace

Flags

- `--agent` — Run agent ID or label (optional)

#### `show`

Show the default baseline run for the current workspace

### `build`

Manage agent builds

#### `create`

Create a new agent build

Flags

- `--description` — Build description
- `--name` (required) — Build name (required)

#### `get <id>`

Get agent build with version history

#### `list`

List agent builds

#### `version`

Manage agent build versions

##### `create <buildId>`

Create a new draft version

Flags

- `--agent-kind` — Agent kind: llm_agent, workflow_agent, programmatic_agent, multi_agent_system, hosted_external
- `--spec-file` — JSON file with version spec fields

##### `get <versionId>`

Get a build version

##### `ready <versionId>`

Mark a version as ready (immutable, deployable)

##### `update <versionId>`

Update a draft build version

Flags

- `--spec-file` — JSON file with updated version spec fields

##### `validate <versionId>`

Validate a build version

### `challenge-pack`

Manage challenge packs

#### `init <file>`

Scaffold a minimal challenge pack YAML bundle

Flags

- `--force` — Overwrite an existing file
- `--name` — Challenge pack display name (defaults from the file name)
- `--slug` — Challenge pack slug (defaults from the file name)
- `--template` default: `prompt_eval` — Starter template: prompt_eval or native

#### `list`

List challenge packs

#### `publish <file>`

Publish a challenge pack YAML bundle

#### `validate <file>`

Validate a challenge pack YAML bundle

### `ci`

Manage AgentClash CI manifests

#### `baseline`

Resolve the baseline selected by an AgentClash CI manifest

Flags

- `--manifest` default: `.agentclash/ci.yaml` — Path to the AgentClash CI manifest

#### `init <file>`

Write a sample AgentClash CI manifest

Flags

- `--force` — Overwrite an existing manifest

#### `run`

Run the AgentClash CI workflow described by a manifest

Flags

- `--artifact-dir` — Write stable AgentClash CI JSON artifacts to this directory
- `--ci-branch` — Branch metadata override
- `--ci-commit` — Commit SHA metadata override
- `--ci-default-branch` — Default branch metadata override for auto_on_main regression promotion
- `--ci-event` — CI event name metadata override
- `--ci-provider` — CI provider metadata override
- `--ci-pull-request` — Positive pull request number metadata override
- `--ci-ref` — Git ref metadata override
- `--ci-repository` — Repository metadata override, for example owner/repo
- `--ci-workflow` — Workflow name metadata override
- `--ci-workflow-run-attempt` — Workflow run attempt metadata override
- `--ci-workflow-run-id` — Workflow run id metadata override
- `--ci-workflow-run-url` — Workflow run URL metadata override
- `--follow` — Stream run events while waiting for the candidate run
- `--github-step-summary` — Append a GitHub Actions step summary when GITHUB_STEP_SUMMARY is set
- `--manifest` default: `.agentclash/ci.yaml` — Path to the AgentClash CI manifest
- `--poll-interval` — Polling interval while waiting for run completion
- `--summary-file` — Write a Markdown CI gate summary to this file
- `--timeout` — Maximum time to wait for the candidate run; 0 disables the timeout

#### `should-run`

Decide whether AgentClash CI should run

Flags

- `--base` — Base git ref for deriving changed files
- `--changed-file` — Changed file path; may be repeated
- `--github-event` — GitHub event JSON file for deriving pull request labels
- `--head` — Head git ref for deriving changed files
- `--labels` — Pull request labels; may be comma-separated or repeated
- `--manifest` default: `.agentclash/ci.yaml` — Path to the AgentClash CI manifest
- `--repo` default: `.` — Git repository path for --base/--head diff

#### `validate <file>`

Validate an AgentClash CI manifest

Flags

- `--remote` — Validate manifest resource IDs against the selected workspace

### `compare`

Compare runs and evaluate release gates

#### `gate`

Evaluate a release gate (nonzero exit = regression or missing evidence)

Flags

- `--baseline` (required) — Baseline run ID (required)
- `--baseline-agent` — Baseline run agent ID (optional)
- `--candidate` (required) — Candidate run ID (required)
- `--candidate-agent` — Candidate run agent ID (optional)

#### `runs`

Compare baseline vs candidate runs

Flags

- `--baseline` (required) — Baseline run ID (required)
- `--baseline-agent` — Baseline run agent ID (optional)
- `--candidate` (required) — Candidate run ID (required)
- `--candidate-agent` — Candidate run agent ID (optional)

### `config`

Manage CLI configuration

#### `get <key>`

Get a config value

#### `list`

List all config values

#### `set <key> <value>`

Set a config value

### `deployment`

Manage agent deployments

#### `create`

Create an agent deployment

Flags

- `--agent-build-id` — Agent build ID
- `--build-version-id` — Agent build version ID
- `--from-file` — JSON file with deployment spec
- `--model-alias-id` — Model alias ID
- `--name` — Deployment name
- `--provider-account-id` — Provider account ID
- `--runtime-profile-id` — Runtime profile ID

#### `list`

List agent deployments

### `doctor`

Check auth, workspace, and eval readiness

### `eval`

Workflow-first eval commands

#### `scorecard [run]`

Show a run-first scorecard and compare against the bookmarked baseline

Flags

- `--agent` — Run agent ID or label (optional)

#### `start`

Start an eval using names, defaults, and guided selection

Flags

- `--case` — Regression case IDs (repeatable)
- `--deployment` — Deployment ID or exact name (repeatable)
- `--follow` — Follow run events after creation
- `--input-set` — Challenge input set ID, key, or exact name
- `--name` — Run name (optional)
- `--pack` — Challenge pack ID, slug, or exact name
- `--pack-version` — Challenge pack version ID or version number
- `--race-context` — Enable live peer-standings injection during the run (requires 2+ agents)
- `--race-context-cadence` — Override race-context cadence; minimum steps between standings injections, [1, 10]. 0 uses the backend default.
- `--repetitions` — Repeat the eval N times in a multi-run eval session, [1, 100]. >=2 routes through /v1/eval-sessions and unlocks pass@K + pass^K aggregation.
- `--scope` default: `full` — Run scope: full or suite_only
- `--suite` — Regression suite ID or exact name (repeatable)

### `infra`

Manage infrastructure resources

#### `model-catalog`

Browse the global model catalog

##### `get <id>`

Get a model catalog entry

##### `list`

List available models

### `init`

Initialize a project with .agentclash.yaml

Flags

- `--org-id` — Organization ID to bind
- `--workspace-id` — Workspace ID to bind

### `link [workspace]`

Choose and save your default workspace

### `org`

Manage organizations

#### `create`

Create a new organization

Flags

- `--name` (required) — Organization name (required)
- `--slug` — Organization slug (optional, auto-generated)

#### `get <id>`

Get organization details

#### `list`

List organizations you belong to

#### `members`

Manage organization members

##### `invite <orgId>`

Invite a member to the organization

Flags

- `--email` (required) — Email address to invite (required)
- `--role` default: `org_member` — Role: org_admin, org_member

##### `list <orgId>`

List organization members

##### `update <membershipId>`

Update an organization membership

Flags

- `--role` — New role: org_admin, org_member
- `--status` — New status: active, suspended, archived

#### `update <id>`

Update an organization

Flags

- `--name` — New organization name
- `--status` — New status (active, archived)

### `playground`

Manage playgrounds, test cases, and experiments

#### `create`

Create a playground

Flags

- `--from-file` — JSON file with playground spec
- `--name` — Playground name

#### `delete <id>`

Delete a playground

#### `experiment`

Manage playground experiments

##### `batch <playgroundId>`

Create experiments in batch (one per model)

Flags

- `--from-file` — JSON file with batch experiment spec

##### `compare`

Compare two experiments

Flags

- `--baseline` (required) — Baseline experiment ID (required)
- `--candidate` (required) — Candidate experiment ID (required)

##### `create <playgroundId>`

Create an experiment

Flags

- `--from-file` — JSON file with experiment spec

##### `get <experimentId>`

Get an experiment

##### `list <playgroundId>`

List experiments

##### `results <experimentId>`

List results for an experiment

#### `get <id>`

Get a playground

#### `list`

List playgrounds

#### `test-case`

Manage playground test cases

##### `create <playgroundId>`

Create a test case

Flags

- `--from-file` — JSON file with test case spec

##### `delete <testCaseId>`

Delete a test case

##### `list <playgroundId>`

List test cases

##### `update <testCaseId>`

Update a test case

Flags

- `--from-file` — JSON file with test case spec

#### `update <id>`

Update a playground

Flags

- `--from-file` — JSON file with playground spec

### `prompt-eval`

Manage prompt eval configs

#### `import-promptfoo <file>`

Convert a safe Promptfoo subset into an AgentClash prompt eval config

Flags

- `--force` — Overwrite --out when it already exists
- `--lossy` — Allow documented lossy conversions
- `--name` — Prompt eval name for the generated config
- `--out` — Write the converted prompt eval YAML to this path instead of stdout
- `--provider-account` default: `default` — Provider account name or id to use for imported provider aliases

#### `init [file]`

Scaffold a prompt eval YAML config

Flags

- `--force` — Overwrite an existing file
- `--name` — Prompt eval name (defaults from the file name)

#### `results <experiment-id>`

Fetch prompt eval experiment results

Flags

- `--threshold` — Override the assertion pass-rate gate for fetched results

#### `run [file]`

Compile a prompt eval config and launch playground experiments

Flags

- `--ci` — Apply CI-safe validation rules
- `--follow` — Wait for launched experiments and print results
- `--max-cases` — Maximum model x test cases allowed before launch
- `--poll-interval` — Polling interval while following experiments
- `--threshold` — Override thresholds.assertion_pass_rate for this run
- `--timeout` — Maximum time to wait while following experiments; 0 disables the timeout

#### `validate [file]`

Validate a prompt eval YAML config locally

Flags

- `--ci` — Apply CI-safe validation rules
- `--max-cases` — Maximum model x test cases allowed before launch
- `--remote` — Validate referenced AgentClash workspace resources without creating them

### `regression-suite`

Manage regression suites and cases

#### `case`

Manage individual regression cases

##### `capture-production <suiteId>`

Capture a production failure as a proposed regression case

Flags

- `--evidence-tier` — Evidence tier
- `--external-url` — Production incident URL
- `--failure-class` — Failure class
- `--failure-summary` — Failure summary
- `--from-file` — JSON file with production failure capture payload
- `--incident-id` — Production incident ID
- `--observed-at` — Production observation timestamp (RFC3339)
- `--promotion-mode` — Promotion mode: full_executable, output_only, or manual
- `--severity` — Case severity: info, warning, or blocking
- `--source` — Production source label
- `--source-case-key` — Source production case or incident key
- `--source-challenge-identity-id` — Source challenge identity ID
- `--source-challenge-input-set-id` — Source challenge input set ID
- `--source-challenge-pack-version-id` — Source challenge pack version ID
- `--source-item-key` — Source item key
- `--title` — Regression case title

##### `update <caseId>`

Update a regression case

Flags

- `--description` — Case description
- `--from-file` — JSON file with regression case patch payload
- `--severity` — Case severity: info, warning, or blocking
- `--status` — Case status: proposed, active, muted, archived, or rejected
- `--title` — Case title

#### `cases <suiteId>`

List regression cases in a suite

#### `create`

Create a regression suite

Flags

- `--default-gate-severity` — Default gate severity: info, warning, or blocking
- `--description` — Suite description
- `--from-file` — JSON file with regression suite create payload
- `--name` — Suite name
- `--source-challenge-pack-id` — Source challenge pack ID

#### `get <suiteId>`

Get a regression suite

#### `list`

List regression suites

#### `update <suiteId>`

Update a regression suite

Flags

- `--default-gate-severity` — Default gate severity: info, warning, or blocking
- `--description` — Suite description
- `--from-file` — JSON file with regression suite patch payload
- `--name` — Suite name
- `--status` — Suite status: active or archived

### `release-gate`

Inspect evaluated release gates

#### `list`

List evaluated release gates

Flags

- `--baseline` — Baseline run ID
- `--candidate` — Candidate run ID

### `replay`

View execution replays

#### `get <runAgentId>`

Get execution replay steps

Flags

- `--cursor` — Step offset to start from
- `--limit` — Steps per page (1-200)

### `run`

Manage evaluation runs

#### `agents <runId>`

List agents in a run

#### `create`

Create and submit an evaluation run

Flags

- `--case` — Regression case IDs (repeatable)
- `--challenge-pack-version` — Challenge pack version ID (optional in a TTY; prompted when omitted)
- `--deployments` — Agent deployment IDs (optional in a TTY; prompted when omitted)
- `--follow` — Follow run events after creation
- `--include-proposed-regressions` — Include proposed regression cases for validation runs
- `--input-set` — Challenge input set ID (optional)
- `--name` — Run name (optional)
- `--race-context` — Enable live peer-standings injection during the run (requires 2+ agents)
- `--race-context-cadence` — Override race-context cadence; minimum steps between standings injections, [1, 10]. 0 uses the backend default.
- `--scope` default: `full` — Run scope: full or suite_only
- `--suite` — Regression suite IDs (repeatable; required with --scope suite_only unless --case is used)

#### `events <runId>`

Stream live run events via SSE

#### `failures <runId>`

List failure review items for a run

Flags

- `--agent` — Filter by run agent ID
- `--class` — Filter by failure class
- `--cluster` — Filter by failure cluster key
- `--cursor` — Pagination cursor
- `--evidence-tier` — Filter by evidence tier
- `--limit` — Maximum failures to return
- `--severity` — Filter by severity: info, warning, or blocking

#### `get <id>`

Get run details

#### `list`

List runs in the workspace

#### `promote-failure <runId> <challengeIdentityId>`

Promote a run failure into a regression case

Flags

- `--failure-summary` — Failure summary
- `--from-file` — JSON file with promotion payload
- `--promotion-mode` — Promotion mode: full_executable or output_only
- `--run-agent` — Run agent ID
- `--severity` — Case severity: info, warning, or blocking
- `--suite` — Regression suite ID
- `--title` — Regression case title

#### `ranking <runId>`

Get run ranking and composite scores

Flags

- `--sort-by` — Sort by: composite, correctness, reliability, latency, cost

#### `scorecard <runAgentId>`

Get agent scorecard

### `secret`

Manage workspace secrets

#### `delete <key>`

Delete a secret

#### `list`

List workspace secret keys

#### `set <key>`

Create or update a secret

Flags

- `--value` — Secret value (reads from stdin if omitted)

### `version`

Show CLI version information

### `workspace`

Manage workspaces

#### `create`

Create a workspace

Flags

- `--name` (required) — Workspace name (required)
- `--org` — Organization ID (required)
- `--slug` — Workspace slug (optional)

#### `get <id>`

Get workspace details

#### `list`

List workspaces in an organization

Flags

- `--org` — Organization ID (uses default if not set)

#### `members`

Manage workspace members

##### `invite`

Invite a member to the workspace

Flags

- `--email` (required) — Email address to invite (required)
- `--role` default: `workspace_member` — Role: workspace_admin, workspace_member, workspace_viewer

##### `list`

List workspace members

##### `update <membershipId>`

Update a workspace membership

Flags

- `--role` — New role
- `--status` — New status

#### `update <id>`

Update a workspace

Flags

- `--name` — New workspace name
- `--status` — New status (active, archived)

#### `use <id>`

Set the default workspace

## Source pointers

- `cli/cmd/root.go`
- `cli/cmd/auth.go`
- `cli/cmd/workspace.go`
- `cli/cmd/run.go`
- `cli/cmd/compare.go`

---

# Config Reference

Environment variables and config precedence generated from the current source readers.

Source: https://www.agentclash.dev/docs/reference/config
Markdown export: https://www.agentclash.dev/docs-md/reference/config

Pack-authored sandbox settings live in YAML (see [Sandbox & E2B](../challenge-packs/sandbox-and-e2b)); the tables appended below cover **worker and API runtime** knobs discovered from checked-in Go config readers and `.env.example`.

This page is generated from the config readers in the API server, worker, CLI, and the checked-in backend example environment file.

## CLI precedence

- API URL: `--api-url > AGENTCLASH_API_URL > saved user config > http://localhost:8080`
- Workspace: `--workspace > AGENTCLASH_WORKSPACE > project config > user config`
- Output format: `--json > --output > user config > table`

## API Server Environment

| Variable | Default | Description |
| --- | --- | --- |
| `AGENTCLASH_SECRETS_MASTER_KEY` | — | Read by backend/internal/api/config.go. |
| `API_SERVER_BIND_ADDRESS` | `":8080"` | Bind address for the API server process. |
| `APP_ENV` | `"development"` | Select deployment environment behavior. |
| `ARTIFACT_MAX_UPLOAD_BYTES` | `100 << 20` | Upper bound for artifact upload size accepted by the API server. |
| `ARTIFACT_SIGNED_URL_TTL_SECONDS` | `5 * time.Minute` | Expiry window for signed artifact URLs returned by the API server. |
| `ARTIFACT_SIGNING_SECRET` | — | Signing secret for artifact URL generation; required outside local filesystem dev mode. |
| `ARTIFACT_STORAGE_BACKEND` | `"filesystem"` | Choose filesystem or S3-compatible artifact storage. |
| `ARTIFACT_STORAGE_BUCKET` | `"agentclash-dev-artifacts"` | Artifact bucket or logical container name. |
| `ARTIFACT_STORAGE_FILESYSTEM_ROOT` | `filepath.Join(os.TempDir(` | Local artifact root when the filesystem backend is in use. |
| `ARTIFACT_STORAGE_S3_ACCESS_KEY_ID` | — | Access key for S3-compatible artifact storage. |
| `ARTIFACT_STORAGE_S3_ENDPOINT` | — | Optional custom endpoint for S3-compatible artifact storage. |
| `ARTIFACT_STORAGE_S3_FORCE_PATH_STYLE` | `true` | Toggle path-style addressing for S3-compatible storage. |
| `ARTIFACT_STORAGE_S3_REGION` | — | Region for S3-compatible artifact storage. |
| `ARTIFACT_STORAGE_S3_SECRET_ACCESS_KEY` | — | Secret key for S3-compatible artifact storage. |
| `AUTH_MODE` | `"dev"` | Select dev headers or WorkOS-backed authentication for the API. |
| `CORS_ALLOWED_ORIGINS` | — | Allowed browser origins for the API in WorkOS mode. |
| `DATABASE_URL` | `"postgres://agentclash:agentclash@localhost:5432/agentclash?sslmode=disable"` | Postgres connection string. |
| `DODO_API_BASE_URL` | — | Read by backend/internal/api/config.go. |
| `DODO_ENVIRONMENT` | `"test"` | Read by backend/internal/api/config.go. |
| `DODO_PAYMENTS_API_KEY` | — | Read by backend/internal/api/config.go. |
| `DODO_PAYMENTS_WEBHOOK_KEY` | — | Read by backend/internal/api/config.go. |
| `DODO_PRODUCT_PRO_MONTHLY` | — | Read by backend/internal/api/config.go. |
| `DODO_PRODUCT_PRO_YEARLY` | — | Read by backend/internal/api/config.go. |
| `DODO_PRODUCT_TEAM_MONTHLY` | — | Read by backend/internal/api/config.go. |
| `DODO_PRODUCT_TEAM_YEARLY` | — | Read by backend/internal/api/config.go. |
| `FRONTEND_URL` | — | Public web origin used in emails and CLI auth links. |
| `GITHUB_APP_PRIVATE_KEY` | — | Read by backend/internal/api/config.go. |
| `GITHUB_APP_SLUG` | — | Read by backend/internal/api/config.go. |
| `GITHUB_APP_STATE_SECRET` | — | Read by backend/internal/api/config.go. |
| `GITHUB_WEBHOOK_SECRET` | — | Read by backend/internal/api/config.go. |
| `HOSTED_RUN_CALLBACK_SECRET` | `"agentclash-dev-hosted-callback-secret"` | Shared secret for hosted-run callback authentication. |
| `RESEND_API_KEY` | — | Enable invite email sending through Resend. |
| `RESEND_FROM_EMAIL` | — | Sender address for invite emails. |
| `TEMPORAL_HOST_PORT` | `"localhost:7233"` | Temporal frontend address. |
| `TEMPORAL_NAMESPACE` | `"default"` | Temporal namespace used by the API and worker. |
| `WORKOS_CLIENT_ID` | — | WorkOS client ID used when the API is in workos auth mode. |
| `WORKOS_ISSUER` | — | Optional WorkOS issuer override for JWT validation. |

## Worker Environment

| Variable | Default | Description |
| --- | --- | --- |
| `AGENTCLASH_SECRETS_MASTER_KEY` | — | Read by backend/internal/worker/config.go. |
| `APP_ENV` | `"development"` | Select deployment environment behavior. |
| `ARTIFACT_SANDBOX_ASSET_MAX_BYTES` | `100 << 20` | Read by backend/internal/worker/config.go. |
| `ARTIFACT_STORAGE_BACKEND` | `"filesystem"` | Choose filesystem or S3-compatible artifact storage. |
| `ARTIFACT_STORAGE_BUCKET` | `"agentclash-dev-artifacts"` | Artifact bucket or logical container name. |
| `ARTIFACT_STORAGE_FILESYSTEM_ROOT` | `filepath.Join(os.TempDir(` | Local artifact root when the filesystem backend is in use. |
| `ARTIFACT_STORAGE_S3_ACCESS_KEY_ID` | — | Access key for S3-compatible artifact storage. |
| `ARTIFACT_STORAGE_S3_ENDPOINT` | — | Optional custom endpoint for S3-compatible artifact storage. |
| `ARTIFACT_STORAGE_S3_FORCE_PATH_STYLE` | `true` | Toggle path-style addressing for S3-compatible storage. |
| `ARTIFACT_STORAGE_S3_REGION` | — | Region for S3-compatible artifact storage. |
| `ARTIFACT_STORAGE_S3_SECRET_ACCESS_KEY` | — | Secret key for S3-compatible artifact storage. |
| `DATABASE_URL` | `"postgres://agentclash:agentclash@localhost:5432/agentclash?sslmode=disable"` | Postgres connection string. |
| `E2B_API_BASE_URL` | — | Optional E2B API base URL override. |
| `E2B_API_KEY` | — | API key for the E2B sandbox provider. |
| `E2B_REQUEST_TIMEOUT` | `30*time.Second` | Timeout for E2B sandbox API calls. |
| `E2B_TEMPLATE_ID` | — | Template ID for the E2B sandbox provider. |
| `GITHUB_APP_PRIVATE_KEY` | — | Read by backend/internal/worker/config.go. |
| `HOSTED_RUN_CALLBACK_BASE_URL` | `"http://localhost:8080"` | Base URL the worker uses when calling hosted-run callback endpoints. |
| `HOSTED_RUN_CALLBACK_SECRET` | `"agentclash-dev-hosted-callback-secret"` | Shared secret for hosted-run callback authentication. |
| `SANDBOX_PROVIDER` | `"unconfigured"` | Choose unconfigured or e2b for native sandbox execution. |
| `TEMPORAL_HOST_PORT` | `"localhost:7233"` | Temporal frontend address. |
| `TEMPORAL_NAMESPACE` | `"default"` | Temporal namespace used by the API and worker. |
| `WORKER_IDENTITY` | `defaultWorkerIdentity(` | Logical worker identity label. |
| `WORKER_SHUTDOWN_TIMEOUT` | `10 * time.Second` | Graceful shutdown timeout for the worker process. |

## CLI Environment

| Variable | Default | Description |
| --- | --- | --- |
| `AGENTCLASH_API_URL` | — | Override the CLI API base URL. |
| `AGENTCLASH_DEV_ORG_MEMBERSHIPS` | — | Inject development org memberships into the CLI dev-auth path. |
| `AGENTCLASH_DEV_USER_ID` | — | Inject a development user ID for CLI dev mode. |
| `AGENTCLASH_DEV_WORKSPACE_MEMBERSHIPS` | — | Inject development workspace memberships into the CLI dev-auth path. |
| `AGENTCLASH_ORG` | — | Override the default organization ID for CLI commands. |
| `AGENTCLASH_TOKEN` | — | Provide a CLI token directly, mainly for CI or automation. |
| `AGENTCLASH_WORKSPACE` | — | Override the default workspace ID for CLI commands. |

## Backend Example Environment

| Variable | Default | Description |
| --- | --- | --- |
| `AGENTCLASH_SECRETS_MASTER_KEY` | — | Present in the backend example environment file. |
| `API_SERVER_BIND_ADDRESS` | `:8080` | Bind address for the API server process. |
| `APP_ENV` | `development` | Select deployment environment behavior. |
| `AUTH_MODE` | `dev` | Select dev headers or WorkOS-backed authentication for the API. |
| `DATABASE_URL` | `postgres://agentclash:agentclash@localhost:5432/agentclash?sslmode=disable` | Postgres connection string. |
| `DODO_ENVIRONMENT` | `test` | Present in the backend example environment file. |
| `DODO_PAYMENTS_API_KEY` | — | Present in the backend example environment file. |
| `DODO_PAYMENTS_WEBHOOK_KEY` | — | Present in the backend example environment file. |
| `DODO_PRODUCT_PRO_MONTHLY` | — | Present in the backend example environment file. |
| `DODO_PRODUCT_PRO_YEARLY` | — | Present in the backend example environment file. |
| `DODO_PRODUCT_TEAM_MONTHLY` | — | Present in the backend example environment file. |
| `DODO_PRODUCT_TEAM_YEARLY` | — | Present in the backend example environment file. |
| `E2B_API_BASE_URL` | — | Optional E2B API base URL override. |
| `E2B_API_KEY` | — | API key for the E2B sandbox provider. |
| `E2B_REQUEST_TIMEOUT` | `30s` | Timeout for E2B sandbox API calls. |
| `E2B_TEMPLATE_ID` | — | Template ID for the E2B sandbox provider. |
| `FRONTEND_URL` | `http://localhost:3000` | Public web origin used in emails and CLI auth links. |
| `HOSTED_RUN_CALLBACK_BASE_URL` | `http://localhost:8080` | Base URL the worker uses when calling hosted-run callback endpoints. |
| `HOSTED_RUN_CALLBACK_SECRET` | `agentclash-dev-hosted-callback-secret` | Shared secret for hosted-run callback authentication. |
| `REDIS_URL` | `redis://localhost:6379` | Enable Redis-backed event fanout and related features. |
| `RESEND_API_KEY` | — | Enable invite email sending through Resend. |
| `RESEND_FROM_EMAIL` | `noreply@agentclash.dev` | Sender address for invite emails. |
| `SANDBOX_PROVIDER` | `unconfigured` | Choose unconfigured or e2b for native sandbox execution. |
| `TEMPORAL_HOST_PORT` | `localhost:7233` | Temporal frontend address. |
| `TEMPORAL_NAMESPACE` | `default` | Temporal namespace used by the API and worker. |
| `WORKER_IDENTITY` | `agentclash-worker-local` | Logical worker identity label. |
| `WORKER_SHUTDOWN_TIMEOUT` | `10s` | Graceful shutdown timeout for the worker process. |
| `WORKOS_CLIENT_ID` | — | WorkOS client ID used when the API is in workos auth mode. |
| `WORKOS_ISSUER` | — | Optional WorkOS issuer override for JWT validation. |

## Source pointers

- `backend/internal/api/config.go`
- `backend/internal/worker/config.go`
- `cli/internal/config/manager.go`
- `backend/.env.example`

---

# Architecture Overview

AgentClash is a monorepo with a small number of load-bearing runtime components. This page names them and explains why they are split this way.

Source: https://www.agentclash.dev/docs/architecture/overview
Markdown export: https://www.agentclash.dev/docs-md/architecture/overview

AgentClash has four main runtime surfaces:

- a Next.js web app for the product UI
- a Go API server for REST and WebSocket traffic
- a Go worker that executes run workflows
- a Go CLI that talks to the API directly

## System sketch

```text
browser / CLI
      |
      v
  API server  ----> Postgres
      |
      +----> Redis (optional event fanout)
      |
      v
   Temporal  <----> worker
                    |
                    +----> provider router
                    +----> sandbox provider (optional E2B)
                    +----> artifact storage
```

## Why it is shaped this way

Temporal is the backbone because long-running agent work is exactly the kind of thing that turns into retry, timeout, cancellation, and partial-progress pain if you try to improvise a one-off orchestrator. The API server stays relatively thin: validate the request, load context, enqueue or signal workflow work, and expose the resulting state. The worker owns the expensive and failure-prone part of the system: provider calls, sandboxed execution, event recording, and workflow activities.

That split also keeps the user-facing web app simpler. The web app does not need to own the run engine. It just renders state, telemetry, and management flows on top of the API.

## Runtime components

### Web

The web app lives in `web/` and is a Next.js app using App Router.

### API server

The API server entry point is `backend/cmd/api-server/main.go`. It loads config, opens Postgres and Temporal clients, initializes storage, auth, and managers, then starts the HTTP server.

### Worker

The worker entry point is `backend/cmd/worker/main.go`. It loads config, connects to Postgres and Temporal, wires the provider router and sandbox provider, and runs the Temporal worker loop.

### CLI

The CLI lives in `cli/`. The root Cobra command is defined in `cli/cmd/root.go`, and the user-facing workflows are grouped under `auth`, `workspace`, `run`, and `compare`.

## Optional infrastructure

- Redis enables event publishing and fanout.
- E2B enables sandboxed native execution.
- S3-compatible storage replaces local filesystem artifact storage in production.

## Code pointers

- `backend/cmd/api-server/main.go`
- `backend/cmd/worker/main.go`
- `cli/cmd/root.go`
- `web/src/app`

## See also

- [Self-Host Starter](https://www.agentclash.dev/docs-md/getting-started/self-host)
- [Contributor Setup](https://www.agentclash.dev/docs-md/contributing/setup)

---

# Orchestration

The API server accepts the request, but Temporal and the worker are what make long-running run execution survivable.

Source: https://www.agentclash.dev/docs/architecture/orchestration
Markdown export: https://www.agentclash.dev/docs-md/architecture/orchestration

AgentClash models long-running execution as workflow work, not as a single API request that tries to stay alive forever.

## The runtime split

<DiagramOrchestrationRuntimeSplit />

## What the API server does

The API server handles authentication, authorization, request validation, and persistence setup. For run creation, it wires the request into a Temporal workflow starter rather than trying to execute the whole run inline inside the HTTP handler.

That keeps the API server responsive and gives the platform a durable handoff point for work that may outlive the incoming request.

## What the worker does

The worker connects to Temporal, Postgres, the provider router, and the sandbox provider. It owns the expensive part of the system:

- provider calls
- sandbox-backed execution
- event emission
- result persistence

The worker also decides whether sandbox execution is really available. In the current code, `SANDBOX_PROVIDER=unconfigured` is a valid boot mode, which is why local run creation can succeed even when full native execution is not available.

## Why Temporal is load-bearing here

Run execution is exactly the category of problem where retries, timeouts, cancellation, and partial progress stop being “nice to have” as soon as you leave toy demos. Temporal gives AgentClash a durable workflow backbone so the API server can enqueue work, the worker can resume it, and failures can be handled with explicit workflow semantics instead of improvised queue glue.

## Code pointers

- `backend/cmd/api-server/main.go`
- `backend/cmd/worker/main.go`
- `backend/internal/worker`
- `backend/internal/workflow`

## See also

- [Architecture Overview](https://www.agentclash.dev/docs-md/architecture/overview)
- [Frontend Architecture](https://www.agentclash.dev/docs-md/architecture/frontend)

---

# Sandbox Layer

Understand why AgentClash isolates execution behind a sandbox provider boundary and how E2B fits today.

Source: https://www.agentclash.dev/docs/architecture/sandbox-layer
Markdown export: https://www.agentclash.dev/docs-md/architecture/sandbox-layer

The sandbox layer is the execution boundary between AgentClash orchestration and the environment where an agent actually runs.

## Why the sandbox boundary exists

The workflow engine should decide what to run and when to retry. It should not directly own process isolation, filesystem risk, network policy, or provider-specific runtime setup. Those concerns change at a different rate and carry a different failure model.

That is why the architecture keeps a boundary between orchestration and execution:

- the API decides that a run should exist
- Temporal workflows coordinate the lifecycle
- the worker performs execution work
- the sandbox provider supplies isolation for the runnable target

## Why E2B is the current fit

The current local-development and worker docs show E2B as the concrete provider in use today. That gives AgentClash a managed isolation layer without having to invent a bespoke container orchestration story inside the app itself.

The main benefits are straightforward:

- isolation is handled outside the web and API processes
- runtime setup is explicit and configurable through worker environment
- the provider can be swapped later without rewriting the product model around runs and evidence

<DiagramSandboxBoundary />

## What this boundary protects

This is not only about security. It is also about keeping failure domains honest.

When a run fails, you want to know whether the issue belongs to:

- the scheduler
- the worker logic
- the sandbox provider
- the agent itself

A clean sandbox boundary makes that diagnosis easier because provider setup and execution failures do not get mixed into the same code path as API concerns.

## What to read in the code

Start with these files and directories:

- `backend/internal/worker/config.go` for sandbox-related environment surface
- `backend/internal/worker` for worker-side execution behavior
- `docs/worker/local-development.md` for how the local stack expects the provider to be configured

## Why not bake execution directly into the web or API app

Because that would collapse the concerns that need to stay separate. You would tie request handling, orchestration, and risky execution into the same operational surface. That is faster for a demo and worse for a real evaluation platform.

## See also

- [Orchestration](../architecture/orchestration)
- [Evidence Loop](../architecture/evidence-loop)
- [Self-Host](../getting-started/self-host)
- [Config Reference](../reference/config)

---

# Data Model

Learn the core entities behind workspaces, deployments, challenge packs, runs, and evidence in AgentClash.

Source: https://www.agentclash.dev/docs/architecture/data-model
Markdown export: https://www.agentclash.dev/docs-md/architecture/data-model

The AgentClash data model exists to answer one question cleanly: what exactly was run, against what workload, and what evidence did it produce?

## The core entities

At a high level, the schema revolves around a small set of concepts:

- workspaces: the ownership boundary for deployments and evaluation assets
- deployments: the runnable agent targets attached to a workspace
- challenge packs: repeatable workloads that define what should be attempted
- runs: concrete execution attempts
- replay events and artifacts: the evidence emitted while a run executes
- scorecards and comparisons: the summarized judgments built from that evidence

<DiagramWorkspaceDataModel />

## Why the model is shaped this way

The schema is not just there to persist app state. It is there to preserve comparability.

If the system cannot answer these relationships clearly, the product falls apart:

- which workspace owned the evaluated deployment
- which workload definition the run used
- which evidence belongs to that run
- which summary or comparison was derived from that evidence

That is why the data model follows the execution model closely instead of hiding it behind generic analytics tables.

## What the schema needs to make cheap

Three classes of queries matter in practice:

- operator queries: what is currently running, stuck, or failing
- reviewer queries: why did this run score the way it did
- comparison queries: did the new deployment improve or regress on the same workload

The current schema diagrams and domain notes in the repo are already organized around that shape. They optimize for traceability first, not just storage convenience.

## Where to start in the repo

The most useful entry points are:

- `docs/database/schema-diagram.md` for the current entity map
- `docs/domains/domains.md` for domain-level language and ownership boundaries
- `backend/internal/api` for the read and write paths that expose those entities

## See also

- [Runs and Evals](../concepts/runs-and-evals)
- [Challenge Packs and Inputs](../concepts/challenge-packs-and-inputs)
- [Replay and Scorecards](../concepts/replay-and-scorecards)
- [Overview](../architecture/overview)

---

# Evidence Loop

See how AgentClash moves from execution events to replay, scorecards, and future evaluation improvement.

Source: https://www.agentclash.dev/docs/architecture/evidence-loop
Markdown export: https://www.agentclash.dev/docs-md/architecture/evidence-loop

The evidence loop is the path that turns a finished run into something you can replay, score, compare, and learn from later.

## Why this loop matters

The hardest part of agent evaluation is not running one experiment. It is preserving enough evidence to make the next experiment smarter.

AgentClash leans on a canonical event model so the same execution can feed multiple consumers:

- a live or replayable timeline
- artifacts and logs for detailed inspection
- scorecards for compact judgments
- future workload design when a failure is worth preserving

<DiagramEvidenceClosingLoop />

## The canonical event envelope is the hinge

The replay docs in the repo make this explicit: if different subsystems emit ad-hoc logs, you cannot build a trustworthy replay or comparison layer on top. The canonical event envelope is the normalization step that keeps evidence portable.

That gives AgentClash a cleaner stack:

- execution code emits structured events
- the UI can render those events as a timeline
- scoring can summarize those events without losing the underlying trail
- later analysis can reuse the same evidence instead of scraping logs again

## Why scorecards should not replace evidence

A scorecard is the summary, not the truth source. Treat it as the decision layer that sits on top of replay and artifacts. When there is disagreement about a result, the replay and artifact trail should still be there to settle it.

That is the practical reason this loop exists. It keeps ranking and triage honest.

## Where to start in the repo

- `docs/replay/canonical-event-envelope.md` for the event model
- `docs/evaluation/challenge-pack-v0.md` for how useful failures become future workloads
- `web/src/app` for the frontend surfaces that consume replay and results

## See also

- [Replay and Scorecards](../concepts/replay-and-scorecards)
- [Interpret Results](../guides/interpret-results)
- [Data Model](../architecture/data-model)
- [First Eval](../getting-started/first-eval)

---

# Frontend Architecture

The Next.js app is doing three jobs at once: public marketing pages, authenticated product surfaces, and now a public docs experience.

Source: https://www.agentclash.dev/docs/architecture/frontend
Markdown export: https://www.agentclash.dev/docs-md/architecture/frontend

The web app is not just the dashboard. It currently carries the product landing pages, AuthKit-based auth flows, authenticated workspace routes, the blog, and the docs surface.

## Route split

<DiagramFrontendRouteSplit />

## Public pages

The landing page, team page, blog, and docs live in the same Next.js app so they can share typography, branding, and deployment infrastructure. The docs implementation deliberately reuses the same MDX stack already used by the blog instead of adding a second docs framework right away.

## Authenticated app routes

The authenticated product surfaces live under workspace and organization routes and are guarded by WorkOS AuthKit. That split matters for docs because the docs surface needs to stay public even when local WorkOS env is not configured.

## Docs-specific decision

The docs route is now treated as public and bypasses the AuthKit middleware path. That keeps `/docs` reachable in local development and in environments where the docs should be browsable without login.

## Why keep docs in the existing app

For this stage of the product, keeping docs inside the current Next.js app is the pragmatic move:

- no second frontend deployment stack
- reuse of existing fonts, theme tokens, and MDX tooling
- one place to link from product pages into docs

If the docs surface later needs versioning, heavy search, or separate publishing workflows, moving to a more specialized framework still stays open.

## Code pointers

- `web/src/app`
- `web/src/middleware.ts`
- `web/src/lib/blog.ts`
- `web/src/lib/docs.ts`

## See also

- [Architecture Overview](https://www.agentclash.dev/docs-md/architecture/overview)
- [Hosted Quickstart](https://www.agentclash.dev/docs-md/getting-started/quickstart)

---

# Contributor Setup

Clone the repo, bring up the local stack, and pick the fastest development loop for the part of AgentClash you are changing.

Source: https://www.agentclash.dev/docs/contributing/setup
Markdown export: https://www.agentclash.dev/docs-md/contributing/setup

If you are touching backend workflows or the web product, start the full local stack. If you are only changing the CLI, point the CLI at production and skip the backend entirely.

## Full-stack local development

### 1. Clone the repo and start the stack

```bash
git clone https://github.com/agentclash/agentclash.git
cd agentclash
./scripts/dev/start-local-stack.sh
```

That script starts PostgreSQL, Redis, Temporal, the API server, and the worker.

### 2. Start the web app

```bash
cd web
pnpm install
pnpm dev
```

### 3. Seed local data

```bash
cd ..
./scripts/dev/seed-local-run-fixture.sh
./scripts/dev/curl-create-run.sh
```

## Faster loop for CLI-only work

If you only need the CLI:

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
cd cli
go run . auth login --device
go run . link
go run . run list
```

This is the fastest way to change the CLI without also running the API server and worker locally.

## What lives where

- `backend/cmd/api-server` — API entry point
- `backend/cmd/worker` — worker entry point
- `cli/` — Cobra CLI
- `web/` — Next.js app
- `docs/` — existing internal markdown docs and references
- `testing/` — review contracts, test notes, and issue-specific validation docs

## See also

- [Self-Host Starter](https://www.agentclash.dev/docs-md/getting-started/self-host)
- [Architecture Overview](https://www.agentclash.dev/docs-md/architecture/overview)

---

# Codebase Tour

Map the main AgentClash modules before you start changing workflows, APIs, or the web UI.

Source: https://www.agentclash.dev/docs/contributing/codebase-tour
Markdown export: https://www.agentclash.dev/docs-md/contributing/codebase-tour

This repo is easier to navigate if you follow the runtime path instead of reading directories alphabetically.

## Start with the product surfaces

The main user-facing modules are:

- `web/`: the Next.js product site, authenticated app, and docs surface
- `cli/`: the standalone Go CLI used against local or hosted backends
- `backend/`: API and worker-side Go services
- `docs/`: deeper architecture, build-order, replay, and local-development notes

If you only remember one thing, remember this: the web app is not the whole product. AgentClash is a multi-service system with a CLI and a workflow engine behind it.

## Follow a run from request to evidence

A useful mental path through the repo is:

<DiagramCodebaseTourShortcuts />

That is the shortest route from "user action" to "why did this run behave that way".

## Where to look for common tasks

### I need to change the web UX

Start in `web/src/app` for routes and page entry points, then `web/src/components` for the shared UI.

### I need to change auth or API behavior

Start in `backend/internal/api` and trace the handler path from request shape to domain logic.

### I need to change execution behavior

Start in `backend/internal/worker` and the orchestration docs so you understand what the workflow owns versus what the activity owns.

### I need to change local or hosted CLI behavior

Start in `cli/cmd` for command surface and `cli/internal` for config and supporting behavior.

### I need product context before I code

Start in `docs/` instead of guessing. The build-order, domain, replay, and database notes are there for a reason.

## A practical reading order for new contributors

1. Read [Setup](../contributing/setup).
2. Read [Overview](../architecture/overview).
3. Read [Orchestration](../architecture/orchestration).
4. Skim `docs/domains/domains.md` and `docs/database/schema-diagram.md`.
5. Only then start changing handlers, workflows, or UI.

That order is faster than diving straight into implementation files without the system model in your head.

## See also

- [Setup](../contributing/setup)
- [Testing](../contributing/testing)
- [Overview](../architecture/overview)
- [Frontend](../architecture/frontend)

---

# Testing

Choose the smallest useful validation loop and use review checkpoints to keep implementation work scoped.

Source: https://www.agentclash.dev/docs/contributing/testing
Markdown export: https://www.agentclash.dev/docs-md/contributing/testing

Testing in AgentClash should match the surface you changed. Do not default to the biggest possible loop.

## Pick the smallest loop that can prove the change

A useful rule is:

- docs or web route changes: start with `pnpm build` in `web/`
- CLI changes: use the `cli/` module commands and tests
- packaging changes: rehearse the release flow instead of guessing
- workflow or backend changes: validate the specific service path you touched

For CLI work, the repo already gives you a concrete baseline:

```bash
cd cli
go build ./...
go vet ./...
go test -short -race -count=1 ./...
go run github.com/goreleaser/goreleaser/v2@latest check
go run github.com/goreleaser/goreleaser/v2@latest release --snapshot --clean
cd ../web && pnpm build
cd ..
bash testing/cli-e2e-suite.sh --help
```

If the change affects human-facing CLI output, also add assertions for:

- `agentclash --help` and command-level `--help`
- table output for the happy path
- JSON envelopes for workflow commands such as `eval scorecard --json`
- config persistence when the command writes local state

## Use review checkpoints for implementation work

When the change is more than a one-line fix, lock the contract before you start coding.

The review-checkpoint workflow is simple:

1. create `testing/<branch-or-task>.md`
2. write the scope, functional expectations, tests, manual verification, and non-goals
3. treat that contract as fixed until requirements explicitly change
4. after each implementation step, update `/tmp/reviewcheckpoint.json`
5. run the scoped verification listed in the contract before you declare the work ready

This does two things well:

- it prevents scope drift
- it makes agent-assisted changes auditable instead of magical

> Note: Keep `/tmp/reviewcheckpoint.json` local scratch only. It is a working review log,
> not repo content.

## What to record in each checkpoint

A good checkpoint update includes:

- the current step number
- which files changed
- which contract items were addressed
- self-review result for that step
- cumulative review result across all steps so far
- unresolved risks

That cumulative review matters. It is how you catch drift after several small edits have accumulated.

## Manual verification still matters

Not every useful check is a unit test. For docs, routing, and UI work, manual verification is often part of the contract. Be explicit about it.

Examples:

- opening a docs route and confirming it renders
- confirming a public route bypasses auth middleware correctly
- verifying a generated reference page actually contains generated content

## See also

- [Codebase Tour](../contributing/codebase-tour)
- [Setup](../contributing/setup)
- [CLI Reference](../reference/cli)
- [Config Reference](../reference/config)

---

# Agent Build Author Skill

Use when creating, editing, validating, or readying AgentClash agent builds and build versions, including agent identity, spec JSON, prompts, model/runtime expectations, tool bindings, and version readiness.

Source: https://www.agentclash.dev/docs/agent-skills/agent-build-skills/agentclash-agent-build-author
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills/agentclash-agent-build-author

Canonical source: `web/content/agent-skills/agent-build-skills/agentclash-agent-build-author/SKILL.md`

Markdown export: `/docs-md/agent-skills/agent-build-skills/agentclash-agent-build-author`

## Use This Skill When

Use when creating, editing, validating, or readying AgentClash agent builds and build versions, including agent identity, spec JSON, prompts, model/runtime expectations, tool bindings, and version readiness.

## Full SKILL.md

````markdown
---
name: agentclash-agent-build-author
description: Use when creating, editing, validating, or readying AgentClash agent builds and build versions, including agent identity, spec JSON, prompts, model/runtime expectations, tool bindings, and version readiness.
metadata:
  agentclash.role: agent-builds
  agentclash.version: "1"
  agentclash.requires_cli: "true"
---

# AgentClash Agent Build Author

## Purpose
Create or update an AgentClash agent build and draft build version so it has a source-backed spec, a validation-clean prompt policy, and a ready build version ID that deployment setup can consume.

## Use When
- The user needs a new AgentClash agent build or a new version of an existing build.
- A prompt, model expectation, output schema, tool binding, memory, workflow, reasoning, guardrail, or publication note should be captured in build-version JSON.
- A deployment is blocked because the user has no ready `build_version_id`.
- A coding agent needs to turn product behavior into an AgentClash build spec before deployment setup.

## Do Not Use When
- The CLI is not authenticated or no workspace is selected; use `agentclash-cli-setup` first.
- Provider accounts, model aliases, runtime profiles, workspace tools, or secrets are missing; use `agentclash-runtime-resources-setup` first.
- The build version is already ready and the user only needs a deployment; use `agentclash-agent-deployment-setup`.
- The user is authoring challenge pack YAML; use the challenge-pack skills.

## Inputs Needed
- Workspace ID and confirmation that `agentclash doctor` can reach it.
- Build name and optional description.
- Whether to create a new build or add a version to an existing build ID.
- Agent kind: `llm_agent`, `workflow_agent`, `programmatic_agent`, `multi_agent_system`, or `hosted_external`.
- Prompt or policy instructions. Validation requires `policy_spec.instructions`.
- Interface, reasoning, memory, workflow, guardrail, model, output schema, trace, and publication requirements.
- Optional workspace tool IDs and knowledge source IDs to bind to the version.
- Runtime assumptions handed off from `agentclash-runtime-resources-setup`, such as model alias, provider account, shell/network needs, and timeouts.

## Environment
Use hosted production by default:

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
agentclash doctor
```

Build commands need a resolved workspace for list/create. Workspace resolution follows the normal CLI setup path, so prefer `agentclash workspace use <workspace-id>` or `AGENTCLASH_WORKSPACE` over hard-coding IDs into reusable notes.

## Procedure
1. Verify CLI and workspace readiness with `agentclash doctor`.
2. Run `agentclash build list` and decide whether to create a new build or version an existing build.
3. Create the build when needed with `agentclash build create --name ... --description ...`.
4. Draft `version-spec.json` from the exact build-version fields below. Keep unknown or not-yet-used spec sections as JSON objects, not prose.
5. Create a draft build version with `agentclash build version create <BUILD_ID> --spec-file version-spec.json`.
6. Inspect the draft with `agentclash build version get <VERSION_ID> --json`.
7. Validate with `agentclash build version validate <VERSION_ID> --json` so a coding agent can inspect `valid` and `errors`. Fix any errors by editing the spec and running `agentclash build version update <VERSION_ID> --spec-file version-spec.json`.
8. Mark ready only after validation passes and the user confirms the draft is deployable: `agentclash build version ready <VERSION_ID>`.
9. Report the `agent_build_id`, `build_version_id`, `version_status`, agent kind, and deployment prerequisites.

## Spec Fields
Agent build creation accepts:

```json
{
  "name": "Support Triage Agent",
  "description": "Answers support benchmark cases with cited evidence."
}
```

Build version creation and draft update accept:

```json
{
  "agent_kind": "llm_agent",
  "interface_spec": {
    "input": "challenge case prompt plus attached artifacts",
    "output": "JSON answer matching output_schema"
  },
  "policy_spec": {
    "instructions": "Read the full case, inspect provided artifacts, cite evidence, and return only JSON."
  },
  "reasoning_spec": {
    "strategy": "inspect evidence before answering"
  },
  "memory_spec": {},
  "workflow_spec": {
    "steps": ["read input", "inspect artifacts", "answer"]
  },
  "guardrail_spec": {
    "refuse_if": ["missing required evidence"]
  },
  "model_spec": {
    "preferred_model_alias": "primary-chat",
    "temperature": 0.1
  },
  "output_schema": {
    "type": "object",
    "required": ["answer"],
    "properties": {
      "answer": { "type": "string" },
      "evidence": {
        "type": "array",
        "items": { "type": "string" }
      }
    }
  },
  "trace_contract": {
    "must_record": ["tool_calls", "final_answer"]
  },
  "publication_spec": {
    "version_notes": "Initial support triage build."
  },
  "tools": [
    {
      "tool_id": "<WORKSPACE_TOOL_UUID>",
      "binding_role": "evidence_lookup",
      "binding_config": {}
    }
  ],
  "knowledge_sources": [
    {
      "knowledge_source_id": "<KNOWLEDGE_SOURCE_UUID>",
      "binding_role": "reference",
      "binding_config": {}
    }
  ]
}
```

Required readiness invariant: `policy_spec` must contain an `instructions` field. The current validation also checks that `agent_kind` is one of the supported enum values. Omitted JSON spec objects default to `{}`; omitted tool and knowledge-source bindings default to empty lists.

## Commands
Verify setup and inspect existing builds:

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
agentclash doctor
agentclash build list
agentclash build get <BUILD_ID>
```

Create a build:

```bash
agentclash build create \
  --name "Support Triage Agent" \
  --description "Answers support benchmark cases with cited evidence."
```

Create, inspect, validate, update, and ready a build version:

```bash
agentclash build version create <BUILD_ID> --spec-file version-spec.json
agentclash build version get <VERSION_ID> --json
agentclash build version validate <VERSION_ID> --json
agentclash build version update <VERSION_ID> --spec-file version-spec.json
agentclash build version ready <VERSION_ID>
```

When only the agent kind needs to be set or overridden at creation time:

```bash
agentclash build version create <BUILD_ID> --agent-kind llm_agent --spec-file version-spec.json
```

## Expected Output
- `build list` returns workspace build IDs, names, slugs, lifecycle status, and creation timestamps.
- `build get <BUILD_ID>` returns build metadata plus version history.
- `build create` returns a build ID and generated slug.
- `build version create` returns a draft version ID and version number.
- `build version get --json` returns `version_status`, `agent_kind`, all spec objects, tool bindings, knowledge-source bindings, and creation time.
- `build version validate --json` returns `valid: true` when the agent kind is supported and `policy_spec.instructions` exists. Without `--json`, the CLI prints `Version is valid` or `Version has validation errors`.
- `build version ready` changes `version_status` to `ready`; ready versions are the deployable handoff to deployment setup.

## Failure Modes
- `no workspace specified`: run `agentclash link`, pass `--workspace`, set `AGENTCLASH_WORKSPACE`, or add project config with `agentclash init`.
- `name is required`: build creation needs `--name`.
- `request body must be valid JSON`: `--spec-file` must point to valid JSON, not YAML or Markdown.
- Validation fails on `agent_kind`: use one of `llm_agent`, `workflow_agent`, `programmatic_agent`, `multi_agent_system`, or `hosted_external`.
- Validation fails on `policy_spec`: add `policy_spec.instructions` to the version spec.
- Update fails because the version is not draft: ready versions are immutable for normal authoring; create a new build version instead.
- Tool or knowledge-source binding fails validation: IDs must be UUIDs. Workspace tool creation belongs in `agentclash-runtime-resources-setup`.
- Deployment later fails because the version is not ready: run validate, fix errors, then explicitly mark ready.

## Safety Notes
- Do not put provider API keys, tokens, or customer secrets in any build spec field. Use runtime resources and workspace secrets instead.
- Treat `build version ready` as a publish-style action because it makes the version immutable and deployable; get explicit user confirmation first.
- Keep model names and provider-account assumptions in `model_spec` as expectations or notes unless deployment setup has real runtime resource IDs.
- Prefer `build version get --json` before update or ready so the agent does not overwrite a draft blindly.
- Do not invent workspace tool or knowledge-source IDs. List or create the upstream resources first.

## Report Back Format
```text
Workspace: <workspace-id>
Build: <name> (<agent_build_id>)
Version: v<version_number> (<build_version_id>)
Status: <draft | ready>
Agent kind: <agent_kind>
Validation: <valid | errors>
Runtime assumptions: <provider/model/runtime/tool notes>
Next skill: agentclash-runtime-resources-setup | agentclash-agent-deployment-setup
Notes: <version notes, immutable-ready caveats, or blockers>
```

## Related Skills
- `agentclash-cli-setup`
- `agentclash-runtime-resources-setup`
- `agentclash-agent-deployment-setup`

## Related Docs
- `/docs-md/concepts/agents-and-deployments`
- `/docs-md/guides/configure-runtime-resources`
- `/docs-md/reference/cli`
- `/docs-md/reference/config`
````

---

# Agent Deployment Setup Skill

Use when creating, selecting, or diagnosing AgentClash agent deployments for runs, including ready build versions, runtime profiles, provider/model wiring, deployment IDs, workspace context, and run compatibility.

Source: https://www.agentclash.dev/docs/agent-skills/agent-build-skills/agentclash-agent-deployment-setup
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills/agentclash-agent-deployment-setup

Canonical source: `web/content/agent-skills/agent-build-skills/agentclash-agent-deployment-setup/SKILL.md`

Markdown export: `/docs-md/agent-skills/agent-build-skills/agentclash-agent-deployment-setup`

## Use This Skill When

Use when creating, selecting, or diagnosing AgentClash agent deployments for runs, including ready build versions, runtime profiles, provider/model wiring, deployment IDs, workspace context, and run compatibility.

## Full SKILL.md

````markdown
---
name: agentclash-agent-deployment-setup
description: Use when creating, selecting, or diagnosing AgentClash agent deployments for runs, including ready build versions, runtime profiles, provider/model wiring, deployment IDs, workspace context, and run compatibility.
metadata:
  agentclash.role: agent-deployments
  agentclash.version: "1"
  agentclash.requires_cli: "true"
---

# AgentClash Agent Deployment Setup

## Purpose
Turn a ready AgentClash build version plus runtime/provider/model resources into an active deployment ID that runs and eval sessions can select.

## Use When
- A user has a ready build version and runtime resources, but no deployment ID yet.
- A run or eval flow needs `agent_deployment_ids`.
- A deployment create request is failing because build version readiness, provider account, model alias, runtime profile, or workspace context is wrong.
- A coding agent needs to audit whether an existing deployment is runnable before starting `agentclash run create` or `agentclash eval start`.

## Do Not Use When
- The CLI is not authenticated or no workspace is selected; use `agentclash-cli-setup` first.
- Provider accounts, model aliases, runtime profiles, workspace secrets, or workspace tools are missing; use `agentclash-runtime-resources-setup` first.
- The build version is still a draft or has validation errors; use `agentclash-agent-build-author` first.
- The user is authoring challenge pack YAML or choosing input sets; use challenge-pack skills.

## Inputs Needed
- Workspace ID and confirmation that `agentclash doctor` can reach it.
- Deployment name.
- `agent_build_id` and ready `build_version_id`.
- `runtime_profile_id`.
- `provider_account_id`.
- `model_alias_id`, or a raw provider model string such as `gpt-4.1` when creating from JSON so the backend can auto-create/reuse a model alias.
- Optional `deployment_config` JSON object.
- Challenge pack or run requirements that affect runtime profile, tool, network, shell, or model compatibility.

## Environment
Use hosted production by default:

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
agentclash doctor
```

Commands that list or create deployments need a resolved workspace. Use `agentclash workspace use <workspace-id>`, `--workspace`, `AGENTCLASH_WORKSPACE`, or project config from `agentclash init`.

## Procedure
1. Verify CLI auth and workspace context with `agentclash doctor`.
2. Inspect the build and build version with `agentclash build get <BUILD_ID>` and `agentclash build version get <BUILD_VERSION_ID> --json`. Stop if `version_status` is not `ready`.
3. Confirm runtime resources exist: `agentclash infra runtime-profile get <RUNTIME_PROFILE_ID>`, `agentclash infra provider-account get <PROVIDER_ACCOUNT_ID>`, and either `agentclash infra model-alias get <MODEL_ALIAS_ID>` or a raw model value for JSON-file creation.
4. List existing deployments with `agentclash deployment list --json` to avoid duplicates.
5. Create the deployment with flags when you already have a model alias ID, or with `--from-file` when you need `deployment_config` or raw `model` auto-alias behavior.
6. Re-list deployments and record the created deployment ID, status, and `current_build_version_id`.
7. Check run compatibility by confirming the deployment is active and can be passed to `agentclash run create --deployments <DEPLOYMENT_ID>`.
8. Report deployment IDs and any blockers for `agentclash-eval-runner`.

## Deployment Contract
Flag-based creation requires these CLI flags when `--from-file` is not used:

```bash
agentclash deployment create \
  --name support-bot-prod \
  --agent-build-id <AGENT_BUILD_ID> \
  --build-version-id <BUILD_VERSION_ID> \
  --runtime-profile-id <RUNTIME_PROFILE_ID> \
  --provider-account-id <PROVIDER_ACCOUNT_ID> \
  --model-alias-id <MODEL_ALIAS_ID>
```

The CLI requires `--name`, `--agent-build-id`, `--build-version-id`, and `--runtime-profile-id` before it sends a flag-based request. The backend then requires `provider_account_id` and either `model_alias_id` or `model`.

Use JSON when you need the complete API shape, including `deployment_config` or raw model auto-aliasing:

```json
{
  "name": "support-bot-prod",
  "agent_build_id": "<AGENT_BUILD_ID>",
  "build_version_id": "<BUILD_VERSION_ID>",
  "runtime_profile_id": "<RUNTIME_PROFILE_ID>",
  "provider_account_id": "<PROVIDER_ACCOUNT_ID>",
  "model_alias_id": "<MODEL_ALIAS_ID>",
  "deployment_config": {}
}
```

Raw model auto-alias path:

```json
{
  "name": "support-bot-prod",
  "agent_build_id": "<AGENT_BUILD_ID>",
  "build_version_id": "<BUILD_VERSION_ID>",
  "runtime_profile_id": "<RUNTIME_PROFILE_ID>",
  "provider_account_id": "<PROVIDER_ACCOUNT_ID>",
  "model": "gpt-4.1",
  "deployment_config": {}
}
```

When `provider_account_id` plus `model` is supplied and `model_alias_id` is omitted, the backend looks up the provider account, upserts a model catalog entry, and reuses/unarchives/creates a model alias.

## Commands
```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
agentclash doctor

agentclash build get <BUILD_ID>
agentclash build version get <BUILD_VERSION_ID> --json

agentclash infra runtime-profile get <RUNTIME_PROFILE_ID>
agentclash infra provider-account get <PROVIDER_ACCOUNT_ID>
agentclash infra model-alias get <MODEL_ALIAS_ID>

agentclash deployment list --json
agentclash deployment create --help
agentclash deployment create --from-file deployment.json
agentclash deploy list
agentclash deploy create --from-file deployment.json

agentclash run create \
  --challenge-pack-version <CHALLENGE_PACK_VERSION_ID> \
  --deployments <DEPLOYMENT_ID>
```

## Expected Output
- `deployment list --json` returns `items` with deployment `id`, `name`, `status`, `current_build_version_id`, and timestamps.
- `deployment create` returns a created deployment with `id`, `workspace_id`, `agent_build_id`, `current_build_version_id`, `name`, `slug`, `deployment_type`, `status`, `created_at`, and `updated_at`.
- Human output for deployment creation prints `Created deployment <name> (<id>)`.
- The deployment ID can be passed to run creation as `agent_deployment_ids` through `agentclash run create --deployments <DEPLOYMENT_ID>`.

## Run Compatibility Checks
- The deployment should be active in `agentclash deployment list --json`.
- The deployment should point to the intended ready build version via `current_build_version_id`.
- Runtime profile settings should satisfy the challenge pack: shell, network, timeout, max iterations, and max tool calls.
- Provider account and model alias should match the build's `model_spec` expectations.
- Run creation requires at least one deployment. Non-interactive runs use `--deployments`; TTY runs can prompt from active deployments.
- Backend run creation requires deployment IDs to reference active deployments with snapshots in the selected workspace.

## Failure Modes
- `no workspace specified`: run `agentclash link`, pass `--workspace`, set `AGENTCLASH_WORKSPACE`, or add project config with `agentclash init`.
- `missing required flags when --from-file is not used`: pass `--name`, `--agent-build-id`, `--build-version-id`, and `--runtime-profile-id`, or use `--from-file`.
- `only ready versions can be deployed`: return to `agentclash-agent-build-author`, validate the version, and mark it ready.
- `provider_account_id is required`: create/select a provider account with `agentclash-runtime-resources-setup`.
- `either model_alias_id or model ... is required`: pass `--model-alias-id` with flags, or use `--from-file` with either `model_alias_id` or raw `model`.
- `*_id must be a valid UUID`: copy IDs from `build get`, `build version get --json`, `infra ... get`, or `deployment list --json`.
- Run creation rejects the deployment: confirm it is active, belongs to the selected workspace, and has a snapshot.

## Safety Notes
- Do not put raw provider API keys or tokens in `deployment_config`; use workspace secrets and provider accounts.
- Treat deployment creation as production-affecting when `AGENTCLASH_API_URL` points at hosted production.
- Use `list` and `get` commands before creating resources to avoid duplicate deployments.
- Do not invent IDs. If a required ID is missing, run the upstream runtime resources or build authoring skill first.
- Prefer `--json` for machine-readable checks by coding agents.

## Report Back Format
```text
Workspace: <workspace-id>
Deployment: <name> (<deployment_id>)
Status: <active | paused | archived | unknown>
Build: <agent_build_id>
Build version: <build_version_id> (<ready | blocked>)
Runtime profile: <runtime_profile_id>
Provider account: <provider_account_id>
Model: <model_alias_id | raw model auto-alias>
Run compatibility: <ready | blocked>
Next skill: agentclash-eval-runner
Notes: <runtime/model/provider/challenge-pack caveats>
```

## Related Skills
- `agentclash-cli-setup`
- `agentclash-runtime-resources-setup`
- `agentclash-agent-build-author`
- `agentclash-eval-runner`

## Related Docs
- `/docs-md/concepts/agents-and-deployments`
- `/docs-md/guides/configure-runtime-resources`
- `/docs-md/reference/cli`
- `/docs-md/reference/config`
````

---

# Runtime Resources Setup Skill

Use when configuring AgentClash workspace secrets, provider accounts, model catalog entries, model aliases, runtime profiles, workspace tools, and readiness checks required before agent builds, deployments, evals, or runs.

Source: https://www.agentclash.dev/docs/agent-skills/agent-build-skills/agentclash-runtime-resources-setup
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/agent-build-skills/agentclash-runtime-resources-setup

Canonical source: `web/content/agent-skills/agent-build-skills/agentclash-runtime-resources-setup/SKILL.md`

Markdown export: `/docs-md/agent-skills/agent-build-skills/agentclash-runtime-resources-setup`

## Use This Skill When

Use when configuring AgentClash workspace secrets, provider accounts, model catalog entries, model aliases, runtime profiles, workspace tools, and readiness checks required before agent builds, deployments, evals, or runs.

## Full SKILL.md

````markdown
---
name: agentclash-runtime-resources-setup
description: Use when configuring AgentClash workspace secrets, provider accounts, model catalog entries, model aliases, runtime profiles, workspace tools, and readiness checks required before agent builds, deployments, evals, or runs.
metadata:
  agentclash.role: runtime-resources
  agentclash.version: "1"
  agentclash.requires_cli: "true"
---

# AgentClash Runtime Resources Setup

## Purpose
Prepare the workspace infrastructure chain that lets AgentClash turn a ready agent build version into a runnable deployment: secrets, provider accounts, model aliases, runtime profiles, optional workspace tools, and readiness checks.

## Use When
- A deployment cannot run because provider accounts, model aliases, workspace secrets, runtime profiles, or workspace tools are missing.
- A user has a ready agent build/version but does not yet have the runtime resources needed to deploy it.
- A challenge pack or deployment references a secret, tool, model alias, provider account, or runtime profile that does not exist in the selected workspace.
- A coding agent needs a checklist before moving from CLI setup to agent build/deployment setup.

## Do Not Use When
- The CLI is not authenticated or no workspace is selected; use `agentclash-cli-setup` first.
- The user is authoring challenge pack YAML fields; use challenge-pack skills after workspace resources are clear.
- The user is creating the agent build itself; use `agentclash-agent-build-author`.
- The user already has resource IDs and only needs to create/select a deployment; use `agentclash-agent-deployment-setup`.

## Inputs Needed
- Workspace ID and confirmation that `agentclash doctor` can reach it.
- Provider key, such as `openai`, and the credential secret key name.
- Desired model catalog entry ID or enough criteria to pick one from the model catalog.
- Runtime profile requirements: execution target, trace mode, iteration/tool limits, timeouts, and sandbox/network policy.
- Model alias key/display name and, when needed, the provider account it should bind to.
- Optional workspace tool names, tool kinds, and JSON specs.
- Whether the user wants to create resources now or only audit readiness.

## Environment
Use hosted production by default:

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
```

Commands that create or list workspace resources also need a resolved workspace:

```bash
agentclash doctor
agentclash secret list
```

If workspace resolution fails, run the CLI setup skill first and do not create resources yet.

## Resource Order
1. Verify CLI auth and workspace context.
2. Store provider credentials as workspace secrets.
3. Inspect the global model catalog and choose a model catalog entry.
4. Create a provider account that references a workspace secret.
5. Create a runtime profile with execution and sandbox limits.
6. Create a model alias that points at a catalog entry and optionally binds to a provider account.
7. Create optional workspace tools if deployments or packs expect reusable workspace tools.
8. List resources and record IDs for agent build/deployment setup.

## Procedure
1. Run `agentclash doctor` and stop on auth or workspace warnings.
2. Run `agentclash secret list` to see which secret keys already exist. If a secret value is not already available in the user's shell, ask the user to set it themselves with `agentclash secret set <KEY>`; do not request or receive the value in chat.
3. Run `agentclash infra model-catalog list` and `get` the candidate model entry before creating aliases.
4. Create or select the provider account. Prefer `credential_reference: "workspace-secret://KEY"` over putting raw keys in JSON files.
5. Create or select a runtime profile. Keep limits explicit: iterations, tool calls, step timeout, run timeout, and sandbox/network policy.
6. Create or select a model alias that points to the model catalog entry and, when needed, the provider account.
7. Create workspace tools only when the deployment or product workflow needs reusable workspace tool resources. Keep these separate from pack-defined composed tools.
8. Re-list all resources and report the IDs downstream skills need.

## Commands
Verify setup and workspace:

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
agentclash doctor
```

Store provider credentials as workspace secrets:

```bash
printf '%s' "$OPENAI_API_KEY" | agentclash secret set OPENAI_API_KEY
agentclash secret list
```

Inspect the global model catalog:

```bash
agentclash infra model-catalog list
agentclash infra model-catalog get <MODEL_CATALOG_ENTRY_ID>
```

Create a provider account from a JSON file:

```json
{
  "provider_key": "openai",
  "name": "OpenAI Workspace Account",
  "credential_reference": "workspace-secret://OPENAI_API_KEY",
  "limits_config": {
    "rpm": 60
  }
}
```

```bash
agentclash infra provider-account create --from-file provider-account.json
agentclash infra provider-account list
agentclash infra provider-account get <PROVIDER_ACCOUNT_ID>
```

Create a runtime profile:

```json
{
  "name": "default-native",
  "execution_target": "native",
  "trace_mode": "full",
  "max_iterations": 24,
  "max_tool_calls": 32,
  "step_timeout_seconds": 120,
  "run_timeout_seconds": 1800,
  "profile_config": {
    "sandbox": {
      "allow_shell": true,
      "allow_network": false
    }
  }
}
```

```bash
agentclash infra runtime-profile create --from-file runtime-profile.json
agentclash infra runtime-profile list
agentclash infra runtime-profile get <RUNTIME_PROFILE_ID>
```

Create a model alias:

```json
{
  "alias_key": "primary-chat",
  "display_name": "Primary Chat Model",
  "model_catalog_entry_id": "<MODEL_CATALOG_ENTRY_ID>",
  "provider_account_id": "<PROVIDER_ACCOUNT_ID>"
}
```

```bash
agentclash infra model-alias create --from-file model-alias.json
agentclash infra model-alias list
agentclash infra model-alias get <MODEL_ALIAS_ID>
```

Optional workspace tools:

`tool.json`:

```json
{
  "name": "inventory-api",
  "tool_kind": "http",
  "capability_key": "inventory.lookup",
  "definition": {}
}
```

```bash
agentclash infra tool list
agentclash infra tool create --from-file tool.json
agentclash infra tool get <TOOL_ID>
```

Archive or delete only with explicit user confirmation:

```bash
agentclash infra runtime-profile archive <RUNTIME_PROFILE_ID>
agentclash infra model-alias delete <MODEL_ALIAS_ID>
agentclash infra provider-account delete <PROVIDER_ACCOUNT_ID>
agentclash secret delete <SECRET_KEY>
```

## Expected Output
- `secret list` shows secret keys and timestamps, never secret values.
- `model-catalog list` returns global model entries with provider, model, family, and lifecycle status.
- `provider-account list` shows provider key, account name, status, and ID.
- `runtime-profile list` shows execution target, max iterations, and ID.
- `model-alias list` shows alias key, display name, status, and ID.
- `infra tool list` shows workspace tool name, kind, lifecycle status, and ID.
- The final handoff contains the provider account ID, runtime profile ID, model alias ID, relevant secret key names, and optional tool IDs.

## Failure Modes
- `no workspace specified`: run `agentclash link`, pass `--workspace`, set `AGENTCLASH_WORKSPACE`, or add project config with `agentclash init`.
- Provider account creation fails because the secret is missing: run `agentclash secret list`, then set the expected key and use `workspace-secret://KEY`.
- A raw `api_key` was passed and cannot be read back: expected behavior; the infrastructure manager stores it as a workspace secret named `PROVIDER_<PROVIDER_KEY>_API_KEY` and keeps only `workspace-secret://PROVIDER_<PROVIDER_KEY>_API_KEY` on the provider account. The provider key is uppercased and hyphens become underscores, so `x-ai` becomes `PROVIDER_X_AI_API_KEY`.
- Model alias creation fails because the model catalog entry is wrong or unavailable: inspect with `agentclash infra model-catalog get <MODEL_CATALOG_ENTRY_ID>` and choose an active entry for the intended provider.
- Deployment setup later fails because no runtime profile exists: create or select a runtime profile and pass its ID into deployment setup.
- Runs fail because network, shell, timeout, or tool-call limits are too strict: review `profile_config`, `max_iterations`, `max_tool_calls`, `step_timeout_seconds`, and `run_timeout_seconds`.
- Workspace tools are confused with pack-defined tools: workspace tools are `agentclash infra tool ...` resources; pack-defined tools live inside challenge pack YAML.

## Safety Notes
- Never print, paste, request, receive, or commit raw provider keys. Prefer stdin for `secret set`; if the value is not already in the user's shell, ask the user to run the command themselves.
- Prefer `credential_reference: "workspace-secret://KEY"` in provider account specs.
- Treat `delete` and `archive` commands as destructive enough to require explicit user confirmation.
- Use `list` and `get` before `create`, `delete`, or `archive` to avoid duplicating or mutating the wrong workspace resource.
- Keep local/self-hosted API URLs explicit; hosted examples should use `https://api.agentclash.dev`.

## Report Back Format
```text
Workspace: <workspace-id>
Secrets: <KEY present | KEY missing>
Provider account: <id or action needed>
Model catalog entry: <id and provider/model>
Runtime profile: <id or action needed>
Model alias: <id or action needed>
Workspace tools: <ids or none>
Readiness: <ready for deployment setup | blocked>
Next skill: agentclash-agent-build-author | agentclash-agent-deployment-setup
Notes: <credential, limit, sandbox, or tool caveats>
```

## Related Skills
- `agentclash-cli-setup`
- `agentclash-agent-build-author`
- `agentclash-agent-deployment-setup`
- `agentclash-challenge-pack-tools-sandbox`

## Related Docs
- `/docs-md/guides/configure-runtime-resources`
- `/docs-md/concepts/agents-and-deployments`
- `/docs-md/concepts/tools-network-and-secrets`
- `/docs-md/reference/cli`
- `/docs-md/reference/config`
````

---

# Challenge Pack Artifacts Skill

Use when specifying AgentClash challenge pack assets, artifact references, produced file captures, evidence references, artifact upload/download expectations, and review-only evidence.

Source: https://www.agentclash.dev/docs/agent-skills/challenge-pack-skills/agentclash-challenge-pack-artifacts
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-artifacts

Canonical source: `web/content/agent-skills/challenge-pack-skills/agentclash-challenge-pack-artifacts/SKILL.md`

Markdown export: `/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-artifacts`

## Use This Skill When

Use when specifying AgentClash challenge pack assets, artifact references, produced file captures, evidence references, artifact upload/download expectations, and review-only evidence.

## Full SKILL.md

````markdown
---
name: agentclash-challenge-pack-artifacts
description: Use when specifying AgentClash challenge pack assets, artifact references, produced file captures, evidence references, artifact upload/download expectations, and review-only evidence.
metadata:
  agentclash.role: challenge-pack-artifacts
  agentclash.version: "1"
  agentclash.requires_cli: "true"
---

# AgentClash Challenge Pack Artifacts

## Purpose
Make challenge pack files and evidence explicit without confusing three different surfaces:

- source assets declared in challenge-pack YAML
- case artifact references exposed to scoring evidence
- files produced by the agent and captured after execution

Use this skill when a coding agent needs exact artifact fields and evidence references without reading the AgentClash source repo.

## Use When
- A pack includes fixture files, expected reports, images, logs, datasets, or other stored assets.
- Cases need `inputs[].artifact_key`, `expectations[].artifact_key`, `artifact_refs`, or case `artifacts`.
- Scoring needs `artifact...` or `file:...` evidence references.
- Native execution should capture files from the sandbox through `post_execution_checks`.
- A reviewer needs to inspect artifacts after a run.

## Do Not Use When
- The pack only needs plain text final-output scoring; use `agentclash-challenge-pack-scoring-validators`.
- The task is selecting input sets or regression suites; use `agentclash-challenge-pack-input-sets` or `agentclash-regression-flywheel`.
- The task is configuring tools, sandbox network, or package access; use `agentclash-challenge-pack-tools-sandbox`.

## Environment
Use hosted production for CLI examples unless the user intentionally targets a local or self-hosted backend.

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
```

`agentclash challenge-pack validate` is a hosted API call: authenticate first and select a workspace with `agentclash link`, `--workspace`, `AGENTCLASH_WORKSPACE`, or local `.agentclash.yaml`. Use `agentclash-cli-setup` if the machine is not already logged in.

## Validation Commands
Validate after changing any asset, artifact reference, input artifact key, expectation artifact key, or file evidence target.

```bash
agentclash challenge-pack validate path/to/pack.yaml
agentclash challenge-pack validate path/to/pack.yaml --json
```

Human output prints `Challenge pack is valid` or `Challenge pack has errors`. Use `--json` for structured `valid` and `errors` fields.

## Asset Declaration Shape
Challenge-pack assets use the same field shape at `version.assets`, `challenges[].assets`, and `input_sets[].cases[].assets`.

```yaml
version:
  assets:
    - key: policy_pdf
      kind: document
      path: assets/policy.pdf
      media_type: application/pdf
      artifact_id: 00000000-0000-0000-0000-000000000000

challenges:
  - key: summarize-policy
    title: Summarize Policy
    category: document
    difficulty: medium
    assets:
      - key: challenge_notes
        path: assets/challenge-notes.md

input_sets:
  - key: policy-smoke
    name: Policy Smoke
    cases:
      - challenge_key: summarize-policy
        case_key: summary-basic
        assets:
          - key: case_notes
            path: assets/cases/summary-basic-notes.md
```

Source-backed fields:

- `key`: required and unique inside that one `assets` list.
- `path`: required, even when `artifact_id` is present.
- `kind`: optional string.
- `media_type`: optional string.
- `artifact_id`: optional UUID for an already stored artifact.

The current validator does not enforce enums for `kind` or `media_type`; use clear, conventional values and validate the pack.

## Version Assets Versus Local Assets
Only `version.assets` form the reference pool for artifact references.

These fields must reference a declared `version.assets[].key`:

- `challenges[].artifact_refs[].key`
- `input_sets[].cases[].artifacts[].key`
- `input_sets[].cases[].inputs[].artifact_key`
- `input_sets[].cases[].expectations[].artifact_key`
- `input_sets[].cases[].expectations[].source: artifact:<key>`

Challenge-level and case-level `assets` are validated for `key` and `path`, but they are not accepted as targets for `artifact_refs`, `artifacts`, `artifact_key`, or `source: artifact:...`. If a file needs to be referenced by scoring or case evidence, declare it in `version.assets`.

## Artifact References
Use `artifact_refs` on a challenge to say which version assets matter to that challenge. Use case `artifacts` to expose version assets as artifact evidence for a case.

```yaml
version:
  assets:
    - key: policy_pdf
      path: assets/policy.pdf
      media_type: application/pdf
    - key: expected_summary
      path: assets/expected-summary.json
      media_type: application/json

challenges:
  - key: summarize-policy
    title: Summarize Policy
    category: document
    difficulty: medium
    artifact_refs:
      - key: policy_pdf

input_sets:
  - key: policy-smoke
    name: Policy Smoke
    cases:
      - challenge_key: summarize-policy
        case_key: summary-basic
        artifacts:
          - key: policy_pdf
          - key: expected_summary
        inputs:
          - key: prompt
            kind: text
            value: Summarize the refund policy.
          - key: source_document
            kind: file
            artifact_key: policy_pdf
            path: assets/policy.pdf
        expectations:
          - key: summary_reference
            kind: json
            artifact_key: expected_summary
          - key: summary_reference_via_source
            kind: json
            source: artifact:expected_summary
```

Hard validation rules:

- `artifact_refs[].key` is required, unique in that `artifact_refs` list, and must reference `version.assets`.
- Case `artifacts[].key` is required, unique in that case `artifacts` list, and must reference `version.assets`.
- `inputs[].artifact_key` and `expectations[].artifact_key` must reference `version.assets`.
- `expectations[].source` may be empty, `input:<case-input-key>`, or `artifact:<version-asset-key>`.

## Evidence References
Scoring references case and artifact evidence through supported evidence-reference strings.

Artifact evidence:

- Supported prefix shape: `artifact.<path>`.
- Do not read `<path>` as a filesystem path here. The concrete shape is `artifact.<artifact_key>[.<field>]`, where the first segment after `artifact.` is the case artifact key, for example `artifact.expected_summary`.
- Use `artifact.expected_summary.path` for the asset path.
- Use `artifact.expected_summary.media_type`, `artifact.expected_summary.kind`, or `artifact.expected_summary.key` when the metadata is what the validator or judge needs.

Case evidence:

- `case.payload`
- `case.payload.<field>`
- `case.inputs.<input_key>`
- `case.expectations.<expectation_key>`

Produced file evidence:

- `file:<post_execution_check_key>`

Use `literal:<value>` for inline expected values when a validator requires `expected_from`.

## Produced File Capture
Agent-produced files are not declared with pack `assets`. Capture them after native execution through `version.evaluation_spec.post_execution_checks`.

```yaml
version:
  execution_mode: native
  tool_policy:
    allowed_tool_kinds:
      - file
      - build
  evaluation_spec:
    name: policy-artifact-eval
    version_number: 1
    judge_mode: deterministic
    post_execution_checks:
      - key: generated_summary
        type: file_capture
        path: /workspace/summary.json
        max_size_bytes: 1048576
      - key: project_listing
        type: directory_listing
        path: /workspace
        recursive: true
    validators:
      - key: generated_summary_exists
        type: file_exists
        target: file:generated_summary
      - key: generated_summary_matches_schema
        type: file_json_schema
        target: file:generated_summary
        config:
          schema:
            type: object
    scorecard:
      dimensions:
        - key: correctness
          source: validators
          weight: 1
```

Source-backed `post_execution_checks` fields:

- `key`: required and unique.
- `type`: required and must be `file_capture` or `directory_listing`.
- `path`: required.
- `recursive`: optional boolean, useful for `directory_listing`.
- `max_size_bytes`: optional integer; must be greater than or equal to `0`.

File validators must target `file:<post_execution_check_key>`. `code_execution` validators must reference a `file_capture`, not a `directory_listing`.

## Review-Only Evidence Patterns
Use review-only evidence when humans need debugging context but scoring should not depend on it.

Good review-only patterns:

- Add a `directory_listing` check such as `project_listing` so reviewers can see what the agent created.
- Capture small logs with `file_capture` when they explain failures.
- Upload external review material with the artifact CLI and attach metadata that explains the source.
- Report artifact keys, paths, and IDs back to the user without claiming they were scored.

Keep scored evidence and review evidence separate in the report. If a file should affect the score, reference it from a validator, judge `context_from`, or judge `reference_from`. If it is only context for humans, leave it unreferenced by scoring.

## Artifact CLI Surface
Workspace artifacts are managed by the CLI, but challenge-pack YAML does not have upload or download commands inside the bundle.

```bash
agentclash artifact list
agentclash artifact list --json
agentclash artifact upload path/to/file --type reference-data
agentclash artifact upload path/to/file --type reference-data --metadata '{"purpose":"review"}' --json
agentclash artifact download <ARTIFACT_ID> --output path/to/file
```

Exact command notes:

- `agentclash artifact list` lists artifacts in the current workspace. It does not have a `--run` filter today.
- `agentclash artifact upload <file>` requires `--type`.
- Upload also accepts optional `--run <RUN_ID>`, `--run-agent <RUN_AGENT_ID>`, and `--metadata <JSON>`.
- `agentclash artifact download <artifactId>` writes to stdout by default.
- Use `--output` or `-O` to save a downloaded artifact to a file.

Use these commands for workspace artifact inspection or preloading stored artifacts. Use `version.assets[].artifact_id` only when you already have a stored artifact UUID to reference.

## Common Validation Failures
- An asset omits `key` or `path`.
- Duplicate `key` values appear inside one `assets`, `artifact_refs`, or case `artifacts` list.
- `artifact_refs[].key` points at a challenge-local or case-local asset instead of `version.assets`.
- Case `artifacts[].key`, `inputs[].artifact_key`, or `expectations[].artifact_key` points at an undeclared version asset.
- `source: artifact:...` uses an empty key or a key not declared in `version.assets`.
- A scoring validator targets `file:generated_summary` but no matching `post_execution_checks[].key` exists.
- A file validator uses `artifact.expected_summary` instead of `file:<post_execution_check_key>`.
- A `code_execution` validator targets a `directory_listing`.
- The pack uses produced-output assets under `version.assets` instead of `post_execution_checks`.

## Authoring Procedure
1. List every static file the pack needs.
2. Put any file that must be referenced by cases or scoring in `version.assets`.
3. Add challenge-local or case-local `assets` only for local organization; do not rely on them for `artifact_key` references.
4. Add `artifact_refs` to each challenge when a version asset belongs to that challenge.
5. Add case `artifacts` when scoring or review evidence should expose a version asset on that case.
6. Use `inputs[].artifact_key`, `expectations[].artifact_key`, or `source: artifact:<key>` only for declared version assets.
7. For agent-produced files, use `version.evaluation_spec.post_execution_checks`, then target them with `file:<key>`.
8. Decide which captured files are scored and which are review-only.
9. Run `agentclash challenge-pack validate path/to/pack.yaml --json`.
10. Report exact artifact keys, file paths, scored evidence references, review-only evidence, and validation result.

## Safety
- Do not include raw secret values in assets, captured files, artifact metadata, examples, chat, or commits.
- Keep captured files small and deliberate; `file_capture` content becomes scoring evidence.
- Avoid capturing broad directories unless a listing is needed for review.
- Do not store private customer data as long-lived `version.assets` unless retention and access are approved.
- Prefer deterministic fixture assets over live mutable files.

## Report Back Format
```text
Version assets:
- key:
  path:
  media_type:
  artifact_id:
Challenge artifact refs:
Case artifacts:
Input artifact keys:
Expectation artifact keys/sources:
Produced file captures:
- key:
  type:
  path:
  scored: <yes/no>
Scoring evidence refs:
Review-only evidence:
Artifact CLI commands used:
Validation command:
Validation result:
Open issues:
```

## Related Skills
- `agentclash-challenge-pack-input-sets`
- `agentclash-challenge-pack-tools-sandbox`
- `agentclash-challenge-pack-scoring-validators`
- `agentclash-challenge-pack-llm-judges`
- `agentclash-challenge-pack-validation-publish`
- `agentclash-scorecard-reader`
````

---

# Challenge Pack Input Sets Skill

Use when designing AgentClash challenge pack cases and input sets for smoke, full benchmark, regression, edge-case, or CI suite-only coverage.

Source: https://www.agentclash.dev/docs/agent-skills/challenge-pack-skills/agentclash-challenge-pack-input-sets
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-input-sets

Canonical source: `web/content/agent-skills/challenge-pack-skills/agentclash-challenge-pack-input-sets/SKILL.md`

Markdown export: `/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-input-sets`

## Use This Skill When

Use when designing AgentClash challenge pack cases and input sets for smoke, full benchmark, regression, edge-case, or CI suite-only coverage.

## Full SKILL.md

````markdown
---
name: agentclash-challenge-pack-input-sets
description: Use when designing AgentClash challenge pack cases and input sets for smoke, full benchmark, regression, edge-case, or CI suite-only coverage.
metadata:
  agentclash.role: challenge-pack-inputs
  agentclash.version: "1"
  agentclash.requires_cli: "true"
---

# AgentClash Challenge Pack Input Sets

## Purpose
Design `input_sets` and cases that are valid, repeatable, and useful for runs, regression promotion, and CI gates.

Use this skill after the challenge pack structure is known and before scoring is finalized. The goal is to make every case observable: each case should have a stable key, a declared challenge, concrete inputs, expected evidence, and a clear reason to exist.

## Use When
- A challenge pack needs smoke, full, regression, edge-case, or CI-oriented case subsets.
- Cases exist but are poorly named, duplicated, too broad, or hard to score.
- A coding agent needs exact `input_sets[].cases[]` YAML shape without reading the AgentClash source repo.
- You need to decide which cases are safe for fast checks versus full benchmark runs.

## Do Not Use When
- The pack idea is still vague; use `agentclash-challenge-pack-planner`.
- The user needs the whole YAML file written; use `agentclash-challenge-pack-yaml-author`.
- The task is to configure validators, judges, tools, sandbox, artifacts, validation, publish, or run creation; use the focused downstream skills.

## Environment
Use hosted production for CLI examples unless the user intentionally targets a local or self-hosted backend.

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
```

## Validation Commands
Validate the pack after editing input sets.

```bash
agentclash challenge-pack validate path/to/pack.yaml
agentclash challenge-pack validate path/to/pack.yaml --json
```

Human output prints `Challenge pack is valid` or `Challenge pack has errors`. Use `--json` for structured `valid` and `errors` fields.

## Exact YAML Shape
The current bundle parser accepts top-level `input_sets`, each with `key`, `name`, optional `description`, and `cases`.

```yaml
input_sets:
  - key: refund-smoke
    name: Refund Smoke
    description: Fast refund-policy checks for CI and authoring smoke tests.
    cases:
      - challenge_key: refund-question
        case_key: refund-window-basic
        payload:
          customer_message: Can I get a refund after 14 days?
          account_tier: standard
        inputs:
          - key: prompt
            kind: text
            value: Can I get a refund after 14 days?
        expectations:
          - key: policy_reference
            kind: text
            source: input:prompt
```

Case fields:

- `challenge_key`: required, must reference a declared `challenges[].key`.
- `case_key`: required for new YAML, stable, and unique per challenge within the input set.
- `payload`: optional structured data for the case, such as text, JSON-like fields, IDs, or fixture metadata.
- `inputs`: optional list of concrete case inputs.
- `expectations`: optional list of expected evidence or references.
- `artifacts`: optional list of case artifact refs that must reference declared version assets.
- `assets`: optional case-local assets, each with unique `key` and required `path`.

Use `cases`, not legacy `items`, in new YAML. Legacy `items` are normalized by the parser, but new skills and packs should not author them.

## Hard Validator Rules
- Every input set needs `key`, `name`, and at least one case.
- `input_sets[].key` values must be unique.
- Every case needs `challenge_key` and `case_key`.
- Every `challenge_key` must reference a declared challenge.
- All cases inside the same input set must reference the same `challenge_key`.
- A `case_key` must be unique per challenge within that input set.
- Every case input needs unique `key` and required `kind`.
- Every case expectation needs unique `key` and required `kind`.
- `artifact_key` on inputs or expectations must reference a declared version asset.
- Expectation `source` must be empty, `input:<case-input-key>`, or `artifact:<version-asset-key>`, for example `source: input:prompt`.

Because an input set cannot mix challenge keys today, use separate input sets per challenge when designing pack-wide smoke or CI coverage:

```yaml
input_sets:
  - key: refund-smoke
    name: Refund Smoke
    cases:
      - challenge_key: refund-question
        case_key: refund-window-basic
  - key: billing-smoke
    name: Billing Smoke
    cases:
      - challenge_key: billing-question
        case_key: invoice-copy-basic
```

## Input And Expectation Patterns
Use `payload` for case data the evaluator or prompt builder should understand as structured context. Use `inputs` when individual evidence keys need to be referenced by expectations, validators, judges, or review output.

```yaml
cases:
  - challenge_key: summarize-policy
    case_key: policy-summary-edge-exclusions
    payload:
      audience: customer-support-agent
      risk: exclusion missed
    inputs:
      - key: prompt
        kind: text
        value: Summarize the policy and call out exclusions.
      - key: source_doc
        kind: file
        artifact_key: policy_pdf
        path: assets/policy.pdf
    expectations:
      - key: required_topics
        kind: json
        value:
          must_include:
            - refund window
            - exclusions
      - key: prompt_reference
        kind: text
        source: input:prompt
```

Use `value` when the expected content is inline. Use `source: input:<key>` when the expectation should refer to an input in the same case. Use `source: artifact:<key>` only for version-level assets.

## Input Set Types
Use names that describe run intent and challenge scope.

| Input set | Purpose | Typical size | Guidance |
| --- | --- | --- | --- |
| `<challenge>-smoke` | Fast sanity check | 1-3 cases | Covers the most ordinary success path and one cheap edge case. |
| `<challenge>-ci` | CI gate input set | 1-5 cases | Deterministic, stable, low-cost, no flaky external dependency. |
| `<challenge>-full` | Benchmark coverage | 5+ cases | Representative distribution across easy, medium, hard, and expert cases. |
| `<challenge>-regression` | Known failure replay | As needed | Minimal reproductions of real failures with stable expectations. |
| `<challenge>-edge` | Boundary behavior | Focused | Valid unusual inputs, ambiguity, malformed-but-recoverable payloads, or safety guardrails. |

Do not put unrelated capabilities into one input set. If a run needs multiple challenges, model that through multiple challenge-specific input sets or downstream run/eval selection, not a mixed `challenge_key` input set.

## Coverage Review
For each challenge, check:

- Happy path: ordinary user request or fixture.
- Edge path: unusual but valid input.
- Negative path: refusal, abstention, rejection, or safe fallback when appropriate.
- Ambiguous path: should ask for clarification or make a defensible assumption.
- Regression path: known previous failure with the smallest reproducible case.
- Budget path: confirms the case can run within intended time, tool, and cost limits.

Each case should have a reason. If two cases would fail for the same reason and exercise the same evidence, keep the clearer one unless you need variance.

## Stable Key Rules
- Use lowercase kebab-case keys: `refund-window-basic`, `invoice-missing-id`, `policy-summary-edge-exclusions`.
- Do not include dates, random IDs, or run IDs unless they are part of the scenario being tested.
- Keep `case_key` stable after publish; downstream results, regressions, and reports become easier to compare.
- Prefer descriptive input keys such as `prompt`, `source_doc`, `expected_schema`, or `customer_record`.
- Keep fixture IDs inside `payload`, not in the `case_key`, unless the fixture identity is the scenario.

## Regression And CI Guidance
Regression input sets should be small and forensic: they preserve the evidence needed to reproduce a known failure. Use them to seed regression suites later, but do not confuse pack `input_sets` with regression suites.

CI-oriented input sets should be deterministic. Avoid:
- live third-party data that changes without fixture control
- broad network dependency
- subjective-only expectations with no stable evidence
- large file sets when a smaller fixture proves the behavior

The eval runner can select a published input set with:

```bash
agentclash eval start --input-set <INPUT_SET_ID_OR_KEY_OR_EXACT_NAME>
```

`--scope suite_only` is for regression suite/case selection, not a replacement for `input_sets`.

## Common Validation Failures
- One input set mixes `challenge_key` values.
- The case has neither `case_key` nor legacy `item_key`.
- Duplicate `input_sets[].key`, duplicate case keys, duplicate input keys, or duplicate expectation keys.
- `challenge_key` points at a title instead of the declared challenge `key`.
- `source: input:...` references an input key that does not exist in the same case.
- `source: artifact:...` or `artifact_key` references an undeclared version asset.
- Case-local assets omit `path`.
- The case has expectations that are impossible to observe in final output, files, artifacts, or judge context.

## Authoring Procedure
1. List challenges and confirm their exact `key` values.
2. Draft cases per challenge, not across challenges.
3. Split by run intent: smoke, CI, full, regression, and edge.
4. Give every case a stable `case_key`, concrete `payload`, and clear reason.
5. Add `inputs` when expectations or scoring need named evidence.
6. Add `expectations` that reference inline `value`, `input:<key>`, or `artifact:<key>` only when those references exist.
7. Review duplicates and remove cases with no unique signal.
8. Validate the pack with `agentclash challenge-pack validate ... --json`.
9. Hand off to scoring, validation/publish, or eval runner skills.

## Report Back Format
```text
Challenge:
Input sets:
- key:
  purpose:
  case count:
  intended use: <smoke | ci | full | regression | edge>
  cases:
    - case_key:
      payload summary:
      inputs:
      expectations:
      reason:
Coverage gaps:
Validation command:
Ready for scoring: <yes/no>
Next skill:
```

## Related Skills
- `agentclash-challenge-pack-planner`
- `agentclash-challenge-pack-yaml-author`
- `agentclash-challenge-pack-artifacts`
- `agentclash-challenge-pack-scoring-validators`
- `agentclash-challenge-pack-validation-publish`
- `agentclash-eval-runner`
- `agentclash-regression-flywheel`
````

---

# Challenge Pack LLM Judges Skill

Use when configuring AgentClash LLM-as-judge scoring, judge prompts, rubrics, assertion/reference/n-wise modes, evidence inputs, scorecard dimensions, abstention behavior, and judge result interpretation.

Source: https://www.agentclash.dev/docs/agent-skills/challenge-pack-skills/agentclash-challenge-pack-llm-judges
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-llm-judges

Canonical source: `web/content/agent-skills/challenge-pack-skills/agentclash-challenge-pack-llm-judges/SKILL.md`

Markdown export: `/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-llm-judges`

## Use This Skill When

Use when configuring AgentClash LLM-as-judge scoring, judge prompts, rubrics, assertion/reference/n-wise modes, evidence inputs, scorecard dimensions, abstention behavior, and judge result interpretation.

## Full SKILL.md

````markdown
---
name: agentclash-challenge-pack-llm-judges
description: Use when configuring AgentClash LLM-as-judge scoring, judge prompts, rubrics, assertion/reference/n-wise modes, evidence inputs, scorecard dimensions, abstention behavior, and judge result interpretation.
metadata:
  agentclash.role: challenge-pack-judging
  agentclash.version: "1"
  agentclash.requires_cli: "true"
---

# AgentClash Challenge Pack LLM Judges

## Purpose
Add LLM-as-judge scoring only where deterministic validators cannot capture the whole evaluation. Judges should complement objective checks, not replace them.

Use this skill after deterministic validators and evidence sources are known. A good judge is narrow, evidence-bound, budget-aware, and wired to one scorecard dimension through `source: llm_judge`.

## Use When
- Quality depends on reasoning, helpfulness, style, relevance, faithfulness, or nuanced task completion.
- A deterministic validator would be brittle or incomplete.
- A pack needs rubric, assertion, reference, or n-wise cross-agent ranking.
- The scorecard needs judge rationale, confidence, variance, sample count, and model count in replay/scorecard evidence.

## Do Not Use When
- The behavior can be scored with deterministic validators.
- Evidence sources are not stable yet; use input-sets, artifacts, and scoring validators first.
- The run cannot afford extra model calls.
- The judge would need secrets in prompt text or private data that should not leave the workspace.

## Environment
Use hosted production for CLI examples unless the user intentionally targets a local or self-hosted backend.

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
```

`agentclash challenge-pack validate` calls the hosted API and requires auth plus a workspace. Use `agentclash link`, `--workspace`, `AGENTCLASH_WORKSPACE`, or `.agentclash.yaml` before validating.

## Validation Commands
Validate after changing `judge_mode`, `llm_judges`, judge evidence, scorecard judge dimensions, consensus, or judge limits.

```bash
agentclash challenge-pack validate path/to/pack.yaml
agentclash challenge-pack validate path/to/pack.yaml --json
```

Human output prints `Challenge pack is valid` or `Challenge pack has errors`. Use `--json` for structured `valid` and `errors` fields.

## Minimal Hybrid Shape
Current evaluation specs still require at least one deterministic validator. Use `judge_mode: hybrid` when judges are paired with validators.

```yaml
version:
  evaluation_spec:
    name: support-quality
    version_number: 1
    judge_mode: hybrid
    validators:
      - key: mentions_policy
        type: contains
        target: final_output
        expected_from: literal:refund policy
    llm_judges:
      - mode: rubric
        key: helpfulness
        model: gpt-4o
        samples: 3
        context_from:
          - challenge_input
          - final_output
        rubric: |
          Score 1-5 for whether the answer is correct, complete, and easy for a support agent to use.
    scorecard:
      strategy: weighted
      dimensions:
        - key: correctness
          source: validators
          validators:
            - mentions_policy
          weight: 0.5
        - key: helpfulness
          source: llm_judge
          judge_key: helpfulness
          weight: 0.5
```

Mode coherence rules:

- `judge_mode: deterministic`: `llm_judges` must be empty.
- `judge_mode: llm_judge`: `llm_judges` must contain at least one judge.
- `judge_mode: hybrid`: `llm_judges` must contain at least one judge, and validators are still required by the current scoring validator.

## Judge Fields
Every `llm_judges[]` entry uses this source-backed shape:

```yaml
llm_judges:
  - mode: rubric
    key: stable_judge_key
    model: gpt-4o
    context_from:
      - final_output
    rubric: Rate the output 1-5.
```

Fields:

- `mode`: required; one of `rubric`, `assertion`, `reference`, or `n_wise`.
- `key`: required, unique, and must not collide with validator or metric keys.
- `model`: one model identifier.
- `models`: list of model identifiers. Set exactly one of `model` or `models`.
- `samples`: optional integer from `0` to `10`; `0` means default samples, currently `3`.
- `context_from`: optional list of supported evidence references.
- `output_schema`: optional JSON Schema draft-07 or 2020-12. Validation checks the schema; current judge message builders still instruct the built-in JSON response shape.
- `score_scale`: optional `{min, max}` for `rubric` and `reference`; `min` must be strictly less than `max`, default is `1..5`.
- `rubric`: required for `rubric` and `reference`.
- `assertion`: required for `assertion`.
- `expect`: optional boolean for `assertion`; when false, it flips the pass polarity.
- `prompt`: required for `n_wise`.
- `position_debiasing`: optional boolean for `n_wise`; rotates candidate order across samples.
- `reference_from`: required for `reference`; must be a supported evidence reference.
- `consensus`: required when `models` has more than one entry; invalid otherwise.
- `anti_gaming_clauses`: optional extra prompt clauses appended after the built-in judge instructions.
- `timeout_ms`: optional integer greater than `0`; default judge timeout is 60 seconds.

## Supported Evidence
Judge `context_from` and `reference_from` entries must use supported evidence references:

- `final_output`
- `run.final_output`
- `challenge_input`
- `case.payload`
- `case.payload.<field>`
- `case.inputs.<input_key>`
- `case.expectations.<expectation_key>`
- `artifact.<artifact_key>[.<field>]`
- `file:<post_execution_check_key>`
- `literal:<value>`

Each `context_from` entry is injected into the judge prompt as `<reference>:\n<resolved value>`. Reference judges also inject `reference_answer` from `reference_from`.

If any required judge context or reference evidence is unavailable, the judge result becomes unavailable with a `reason`; it does not produce a `normalized_score`.

## Rubric Mode
Use `mode: rubric` for subjective numeric scoring against a written rubric.

```yaml
llm_judges:
  - mode: rubric
    key: persuasiveness
    model: claude-sonnet-4-6
    samples: 3
    context_from:
      - challenge_input
      - final_output
    score_scale:
      min: 1
      max: 5
    rubric: |
      Score 1-5.
      1: Incorrect, incomplete, or unsupported.
      3: Mostly correct but misses important nuance.
      5: Correct, concise, evidence-bound, and directly useful.
scorecard:
  dimensions:
    - key: persuasiveness
      source: llm_judge
      judge_key: persuasiveness
```

The judge is instructed to return JSON shaped like:

```json
{"score": 4, "confidence": "low|medium|high", "reasoning": "brief rationale"}
```

Rubric and reference scores are clamped to the configured `score_scale` and normalized to `0..1` for the scorecard dimension.

## Assertion Mode
Use `mode: assertion` for yes/no claims such as safety, groundedness, or policy compliance.

```yaml
llm_judges:
  - mode: assertion
    key: no_hallucination
    model: claude-haiku-4-5-20251001
    context_from:
      - case.payload.source_excerpt
      - final_output
    assertion: The response contains only information supported by the source excerpt.
    expect: true
scorecard:
  strategy: hybrid
  dimensions:
    - key: no_hallucination
      source: llm_judge
      judge_key: no_hallucination
      gate: true
      pass_threshold: 1.0
```

The judge is instructed to return JSON shaped like:

```json
{"pass": true, "confidence": "low|medium|high", "reasoning": "brief rationale"}
```

The parser also accepts `verdict` values such as `pass`, `true`, `yes`, `fail`, `false`, or `no`. Assertion samples aggregate by majority and normalize to `1` or `0`.

## Reference Mode
Use `mode: reference` when a gold answer or expected artifact exists.

```yaml
llm_judges:
  - mode: reference
    key: summary_quality
    model: gpt-4o
    context_from:
      - final_output
    reference_from: case.expectations.reference_summary
    rubric: |
      Compare the response to the reference summary for coverage, faithfulness, and concision.
      Penalize unsupported additions.
scorecard:
  dimensions:
    - key: summary_quality
      source: llm_judge
      judge_key: summary_quality
```

`reference_from` must resolve to available evidence. If it does not, the judge is unavailable.

## N-Wise Mode
Use `mode: n_wise` to rank all run agents in the same run against one another.

```yaml
llm_judges:
  - mode: n_wise
    key: overall_quality
    model: claude-sonnet-4-6
    samples: 3
    position_debiasing: true
    context_from:
      - final_output
    prompt: Rank the candidate outputs from best to worst on correctness, completeness, and clarity.
scorecard:
  dimensions:
    - key: overall
      source: llm_judge
      judge_key: overall_quality
```

The judge is instructed to return JSON shaped like:

```json
{"ranking": ["<run_agent_id>", "..."], "confidence": "low|medium|high", "reasoning": "brief rationale"}
```

The parser also accepts `ranked_ids`. Every candidate must appear exactly once. `n_wise` requires at least two run agents in the run. The current run agent receives a normalized Borda-style score from its rank.

## Multi-Model Consensus
Use `models` only when you need cross-model agreement. Multiple models require `consensus`.

```yaml
llm_judges:
  - mode: rubric
    key: quality
    models:
      - claude-sonnet-4-6
      - gpt-4o
    consensus:
      aggregation: median
      min_agreement_threshold: 0.6
      flag_on_disagreement: true
    rubric: Rate overall quality 1-5.
```

Consensus rules:

- `aggregation`: `median`, `mean`, `majority_vote`, or `unanimous`.
- `median` and `mean` are valid only for numeric modes: `rubric`, `reference`, and `n_wise`.
- `majority_vote` is valid only for `assertion`.
- `unanimous` is valid for numeric and assertion modes.
- `min_agreement_threshold` must be between `0` and `1`.
- `flag_on_disagreement` is optional boolean.

Single-model judges must not include `consensus`.

## Judge Limits
Judge limits live under `scorecard.judge_limits`.

```yaml
scorecard:
  judge_limits:
    max_samples_per_judge: 3
    max_calls_usd: 2.50
    max_tokens: 50000
```

Validation rules:

- `max_samples_per_judge`: `0..10`; `0` means use each judge's own `samples` defaulting behavior.
- `max_calls_usd`: greater than or equal to `0`.
- `max_tokens`: greater than or equal to `0`.

Current pack validation range-checks these budget knobs. Still keep each judge's `samples` and `models` small: every judge times every sample times every model can create one LLM call.

## Scorecard Wiring
Each judge-backed dimension must point at exactly one judge key.

```yaml
scorecard:
  strategy: hybrid
  dimensions:
    - key: deterministic_correctness
      source: validators
      validators:
        - mentions_policy
      weight: 0.5
    - key: groundedness
      source: llm_judge
      judge_key: no_hallucination
      better_direction: higher
      gate: true
      pass_threshold: 1.0
    - key: helpfulness
      source: llm_judge
      judge_key: helpfulness
      weight: 0.5
```

Rules:

- `source: llm_judge` requires `judge_key`.
- `judge_key` must reference an existing `llm_judges[].key`.
- `judge_key` must be empty for non-`llm_judge` dimensions.
- `better_direction`, when present for `llm_judge`, must be `higher`.
- All current judge modes produce numeric normalized scores that can feed dimensions.
- `strategy: hybrid` requires at least one gated dimension.

## Abstention And Unavailable Results
There is no `abstention_rule`, `abstain`, or `unable_to_judge` field in the YAML authoring surface.

Judges become unavailable when:

- `context_from` evidence cannot resolve.
- `reference_from` evidence cannot resolve.
- an `n_wise` run has fewer than two run agents.
- all model calls fail or return unparsable output.
- no provider client or credential path is configured for the judge model.

An unavailable judge result has no `normalized_score`; it keeps a `reason` and payload details such as failed calls or `unable_to_judge_count`. A `source: llm_judge` dimension with no normalized score becomes unavailable.

## Result Shape
Judge-backed scorecards and replay surfaces can include `llm_judge_results` with:

- `judge_key`
- `mode`
- `normalized_score`
- `payload`
- `confidence`
- `variance`
- `sample_count`
- `model_count`
- `reason`

`payload` includes call records, model scores, aggregated score, warnings, and n-wise candidates when applicable.

## Security And Anti-Gaming
- Do not put raw secrets in `rubric`, `assertion`, `prompt`, `anti_gaming_clauses`, `context_from`, `reference_from`, or `literal:` values.
- Validation rejects `${secrets.*}` references in `rubric`, `assertion`, and `prompt`; avoid them everywhere else too.
- `anti_gaming_clauses` are additive. The evaluator still injects built-in judge instructions and default anti-gaming behavior.
- Keep evidence narrow. Judges receive resolved context text in provider requests.
- Prefer deterministic validators for hard safety, schema, file, and policy constraints; judges should score nuance.

## Common Validation Failures
- `judge_mode: deterministic` includes `llm_judges`.
- `judge_mode: llm_judge` or `hybrid` has no `llm_judges`.
- A judge `key` is empty, duplicated, or collides with a validator/metric key.
- `mode` is not `rubric`, `assertion`, `reference`, or `n_wise`.
- `rubric` mode omits `rubric`.
- `reference` mode omits `rubric` or `reference_from`.
- `assertion` mode omits `assertion`.
- `n_wise` mode omits `prompt`.
- Both `model` and `models` are set, or neither is set.
- `samples` is negative or greater than `10`.
- `models` has more than one model but no `consensus`.
- `consensus` appears on a single-model judge.
- `context_from` or `reference_from` is not a supported evidence reference.
- `score_scale.min >= score_scale.max`.
- `timeout_ms <= 0`.
- `scorecard.judge_limits` values are out of range.
- `source: llm_judge` dimension omits `judge_key` or references an unknown judge.

## Authoring Procedure
1. Keep deterministic validators for hard checks.
2. Choose the judge mode: `rubric`, `assertion`, `reference`, or `n_wise`.
3. Pick the smallest useful `context_from` evidence set.
4. Write rubric/assertion/prompt text that tells the judge how to use only that evidence.
5. Select one `model`, or use `models` plus `consensus` only when cross-model agreement matters.
6. Keep `samples` low; use `scorecard.judge_limits` for budget guardrails.
7. Wire each judge to one `source: llm_judge` scorecard dimension with `judge_key`.
8. Pair hard gates with deterministic validators or assertion judges.
9. Validate with `agentclash challenge-pack validate path/to/pack.yaml --json`.
10. Report judge keys, modes, evidence refs, model fan-out, budget settings, and any unavailable-evidence risks.

## Report Back Format
```text
Judge mode:
Judges:
- key:
  mode:
  model/models:
  samples:
  context_from:
  reference_from:
  consensus:
  score_scale:
  anti_gaming_clauses:
Scorecard dimensions:
Validator pairings:
Judge limits:
Unavailable-evidence risks:
Security review:
Validation command:
Validation result:
Open issues:
```

## Related Skills
- `agentclash-challenge-pack-scoring-validators`
- `agentclash-challenge-pack-artifacts`
- `agentclash-challenge-pack-input-sets`
- `agentclash-challenge-pack-validation-publish`
- `agentclash-eval-runner`
- `agentclash-scorecard-reader`
````

---

# Challenge Pack Planner Skill

Use when turning a vague AgentClash evaluation idea into a source-backed challenge pack plan with task boundaries, target agents, cases, input sets, scoring strategy, tools, artifacts, runtime policy, validation criteria, and handoff steps.

Source: https://www.agentclash.dev/docs/agent-skills/challenge-pack-skills/agentclash-challenge-pack-planner
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-planner

Canonical source: `web/content/agent-skills/challenge-pack-skills/agentclash-challenge-pack-planner/SKILL.md`

Markdown export: `/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-planner`

## Use This Skill When

Use when turning a vague AgentClash evaluation idea into a source-backed challenge pack plan with task boundaries, target agents, cases, input sets, scoring strategy, tools, artifacts, runtime policy, validation criteria, and handoff steps.

## Full SKILL.md

````markdown
---
name: agentclash-challenge-pack-planner
description: Use when turning a vague AgentClash evaluation idea into a source-backed challenge pack plan with task boundaries, target agents, cases, input sets, scoring strategy, tools, artifacts, runtime policy, validation criteria, and handoff steps.
metadata:
  agentclash.role: challenge-pack-planning
  agentclash.version: "1"
  agentclash.requires_cli: "false"
---

# AgentClash Challenge Pack Planner

## Purpose
Turn an eval idea into a concrete challenge-pack plan before anyone writes YAML.

Use this skill to produce the planning artifact that downstream skills can convert into an AgentClash challenge pack. Planning does not require CLI access, but the plan must match the current challenge-pack model: `pack`, `version`, optional top-level `tools`, `challenges`, and `input_sets`.

## Use When
- A user describes a benchmark, regression idea, eval suite, or "can we test this agent?" scenario.
- The workload needs boundaries before YAML authoring: what behavior is tested, what cases exist, what counts as success, and which tools/files are allowed.
- A coding agent needs enough AgentClash product context to plan a pack without reading the AgentClash source repo.
- You need a handoff to `agentclash-challenge-pack-yaml-author`, `agentclash-challenge-pack-input-sets`, `agentclash-challenge-pack-scoring-validators`, or `agentclash-challenge-pack-llm-judges`.

## Do Not Use When
- The user already has a finished plan and wants valid YAML; use `agentclash-challenge-pack-yaml-author`.
- The task is only to validate or publish a YAML file; use `agentclash-challenge-pack-validation-publish`.
- The task is to choose deployments or start runs; use `agentclash-agent-deployment-setup` or `agentclash-eval-runner`.
- The user needs CLI installation, auth, workspace linking, or hosted setup; use `agentclash-cli-setup`.

## Inputs Needed
- Evaluation goal: the behavior, capability, or failure mode the pack should expose.
- Target agent class: coding agent, support bot, research agent, workflow agent, extraction agent, etc.
- Good, bad, and borderline outputs.
- Expected evidence: final text, JSON fields, captured files/directories, artifacts, metrics, latency, cost, or judge rationale.
- Case inventory: representative, edge, adversarial, regression, and smoke examples.
- Execution needs: `prompt_eval` versus `native`, tools, network, files, packages, secrets, and time budget.
- Release intent: exploration, regression suite, CI gate, public comparison, or customer demo.

## Planning Procedure
1. State the pack boundary in one sentence: what is being tested and what is explicitly out of scope.
2. Choose execution mode:
   - `prompt_eval` for prompt-style tasks that do not need pack-defined tools, sandbox config, or native file/tool execution.
   - `native` when the agent must use files, tools, sandbox policy, network, packages, artifacts, or code/file validators.
3. Define one or more `challenges`. Each challenge should have a stable `key`, title, category, difficulty (`easy`, `medium`, `hard`, or `expert`), and instructions.
4. Design cases before scoring. For each case, define a stable `case_key`, the `challenge_key` it targets, concrete inputs, expected outputs or expectations, and why the case exists.
5. Group cases into `input_sets`: at minimum `smoke` or `default`; add `full`, `regression`, or `ci` only when their purpose and budget differ. Each input set must contain cases for a single `challenge_key`; split mixed-challenge suites into separate input sets.
6. Pick evidence sources. Decide whether success is visible in `final_output`, structured JSON, files, artifacts, tool behavior, metrics, or LLM-judge rationale.
7. Choose scoring:
   - deterministic validators for exact, regex, JSON, numeric, token overlap, math, file, directory, or code-execution checks.
   - LLM judges for subjective quality where deterministic checks cannot honestly capture the behavior.
   - hybrid scoring when hard gates and qualitative judgment both matter.
8. Decide runtime policy only if needed: allowed tool kinds, sandbox network access, package needs, file assets, and secrets. Keep the policy as narrow as the workload allows.
9. Define publish criteria: what must be true before the pack can be validated, published, and used in an eval run.
10. Produce a handoff plan naming the next skill and the missing information, if any.

## Challenge Pack Model
Use these product nouns consistently:

- `pack`: human metadata such as `slug`, `name`, `family`, and optional description.
- `version`: executable version data: `number`, `execution_mode`, `evaluation_spec`, and optional `tool_policy`, `filesystem`, `sandbox`, and `assets`.
- `tools`: optional top-level pack-defined composed tools. Do not plan these for `prompt_eval`.
- `challenges`: task definitions. Cases reference them by `challenge_key`.
- `input_sets`: named groups of runnable cases.
- `cases`: concrete workload items with `case_key`, `payload`, `inputs`, `expectations`, case `artifacts`, and case-local `assets` as needed.
- `evaluation_spec`: score contract with `judge_mode`, validators, optional metrics or LLM judges, runtime limits, pricing, and scorecard dimensions.

## Planning Heuristics
- Prefer fewer, sharper challenges over a broad pack that mixes unrelated behaviors.
- Prefer small smoke sets that fail fast and full sets that measure coverage.
- Each case should teach the evaluator something unique. Duplicate cases need a reason, such as variance or regression coverage.
- Make expectations observable. "Looks good" is not enough; specify the evidence path and what a pass means.
- Use deterministic validators for hard facts, schemas, files, and code behavior. Use LLM judges for judgement calls like helpfulness, prioritization, style, tradeoff quality, or incident reasoning.
- Use `native` only when the task truly needs sandbox/files/tools. Simpler packs are easier to validate and reuse.
- Do not put raw secrets in the plan. Name the required secret keys and say they must be provided through workspace secrets or runtime/provider configuration.
- Do not invent IDs. Planning should name resources by role until validation/publish creates real IDs.

## Execution Mode Decision Table
| Need | Plan |
| --- | --- |
| Single prompt and final text answer | `prompt_eval` |
| Structured extraction from text input | `prompt_eval` unless file tooling is required |
| Agent must read/write files, run tests, or produce artifacts | `native` |
| Pack-defined custom tools | `native` |
| Network access or extra packages | `native` with explicit sandbox policy |
| Code, file, directory, or artifact-backed validators | `native` |
| Pure qualitative grading | `prompt_eval` or `native`, plus `llm_judge` depending on execution needs |

## Scoring Plan Shape
For each scoring dimension, specify:

```text
Dimension: <stable key>
Source: validators | metric | reliability | latency | cost | behavioral | llm_judge
Evidence: <final_output | file:path | artifact key | metric collector | judge key>
Pass rule: <threshold, gate, or qualitative rubric>
Failure message: <what should be reported when this fails>
```

Use `judge_mode: deterministic` when validators and metrics are sufficient. Use `judge_mode: llm_judge` when judges are the main grading surface. Use `judge_mode: hybrid` when deterministic gates and LLM-judge dimensions both matter.

## Case Coverage Checklist
- Happy path: the most ordinary success case.
- Edge case: unusual but valid input.
- Negative or guardrail case: input that should be rejected, abstained from, or handled safely.
- Ambiguity case: forces prioritization or asks for clarification when appropriate.
- Regression case: a known previous failure, with the evidence that should prevent recurrence.
- Budget case: confirms the pack can run within intended time, tool, and cost limits.

## Tool, Sandbox, And Artifact Planning
Only include these when the workload needs them.

- Allowed tool kinds in `version.tool_policy.allowed_tool_kinds` must use supported broad kinds such as `browser`, `build`, `data`, `file`, and `network`.
- `version.sandbox.network_access` should stay false unless the task needs outbound network.
- `version.sandbox.network_allowlist` should be specific when network is needed.
- `version.sandbox.additional_packages` should name only packages required by the workload or validators.
- Version, challenge, and case assets should have stable `key` and `path`; artifact-backed assets also need an `artifact_id` after upload.
- Case expectations can use `value`, `artifact_key`, or `source`. Supported `source` values are empty, `input:<case-input-key>`, or `artifact:<version-asset-key>`.

## Output Format
```text
Pack name:
Slug/family:
Goal:
Out of scope:
Target agent:
Execution mode: <prompt_eval | native>

Challenges:
- key:
  title:
  category:
  difficulty:
  instructions summary:

Input sets:
- key:
  purpose:
  cases:
    - case_key:
      challenge_key:
      inputs:
      expectations:
      reason:

Scoring:
- dimension:
  judge mode:
  validators:
  llm judges:
  gates/thresholds:
  evidence:

Runtime policy:
Tools:
Assets/artifacts:
Secrets:
Publish criteria:
Risks/blockers:
Next skill:
```

## Failure Modes
- The plan has no concrete cases: ask for examples or create explicit draft cases from the user's scenario.
- Cases are not tied to a `challenge_key`: add the missing challenge structure before YAML authoring.
- The plan says `prompt_eval` but needs tools, sandbox, files, or network: switch to `native`.
- The scoring is subjective but only uses exact validators: add an LLM judge or rewrite the expected output into deterministic evidence.
- The scoring is objective but only uses LLM judges: replace with validators where possible.
- Input sets mix smoke, regression, and full benchmark cases without purpose: split them by run intent.
- The plan depends on secrets or private data: name secret keys and artifact roles, not raw values.

## Report Back Format
```text
Planned pack: <name>
Execution mode: <prompt_eval | native>
Challenge count: <n>
Case count: <n>
Input sets: <keys>
Scoring mode: <deterministic | llm_judge | hybrid>
Needs tools/sandbox: <yes/no + why>
Needs assets/artifacts: <yes/no + what>
Needs secrets: <yes/no + names only>
Ready for YAML authoring: <yes/no>
Next skill: <agentclash-challenge-pack-yaml-author | other>
Open questions: <blocking details>
```

## Related Skills
- `agentclash-cli-setup`
- `agentclash-challenge-pack-yaml-author`
- `agentclash-challenge-pack-input-sets`
- `agentclash-challenge-pack-tools-sandbox`
- `agentclash-challenge-pack-artifacts`
- `agentclash-challenge-pack-scoring-validators`
- `agentclash-challenge-pack-llm-judges`
- `agentclash-challenge-pack-validation-publish`

## Related Docs
- `/docs-md/concepts/challenge-packs-and-inputs`
- `/docs-md/guides/write-a-challenge-pack`
- `/docs-md/concepts/tools-network-and-secrets`
- `/docs-md/concepts/artifacts`
- `/docs-md/reference/cli`
````

---

# Challenge Pack Scoring Validators Skill

Use when defining deterministic AgentClash scoring validators, scorecard dimensions, evidence sources, pass/fail rules, numeric metrics, file checks, and validator result interpretation.

Source: https://www.agentclash.dev/docs/agent-skills/challenge-pack-skills/agentclash-challenge-pack-scoring-validators
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-scoring-validators

Canonical source: `web/content/agent-skills/challenge-pack-skills/agentclash-challenge-pack-scoring-validators/SKILL.md`

Markdown export: `/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-scoring-validators`

## Use This Skill When

Use when defining deterministic AgentClash scoring validators, scorecard dimensions, evidence sources, pass/fail rules, numeric metrics, file checks, and validator result interpretation.

## Full SKILL.md

````markdown
---
name: agentclash-challenge-pack-scoring-validators
description: Use when defining deterministic AgentClash scoring validators, scorecard dimensions, evidence sources, pass/fail rules, numeric metrics, file checks, and validator result interpretation.
metadata:
  agentclash.role: challenge-pack-scoring
  agentclash.version: "1"
  agentclash.requires_cli: "true"
---

# AgentClash Challenge Pack Scoring Validators

## Purpose
Design deterministic scoring that is valid, explainable, and stable enough for CI, regression, and benchmark comparisons.

Use deterministic validators when objective evidence can prove the behavior. Reach for LLM judges only when the output truly needs subjective or rubric-based assessment.

## Use When
- A pack needs `version.evaluation_spec.validators`.
- Scoring can use exact text, JSON, numeric, math, file, directory, or code-execution evidence.
- A scorecard dimension should average one or more validator results.
- A pack needs numeric run metrics such as latency, token count, tool calls, cost, or validator pass rate.
- A reviewer needs to understand why a validator passed, failed, errored, or was unavailable.

## Do Not Use When
- The challenge, cases, or artifacts are still undefined; use the planner, input-sets, and artifacts skills first.
- The evaluation needs rubric, assertion, n_wise, or reference judging; use `agentclash-challenge-pack-llm-judges`.
- The task is publishing or running an already authored pack; use validation/publish or eval-runner skills.

## Environment
Use hosted production for CLI examples unless the user intentionally targets a local or self-hosted backend.

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
```

`agentclash challenge-pack validate` calls the hosted API and requires auth plus a workspace. Use `agentclash link`, `--workspace`, `AGENTCLASH_WORKSPACE`, or `.agentclash.yaml` before validating.

## Validation Commands
Validate after changing validators, metrics, dimensions, strategies, file captures, or evidence references.

```bash
agentclash challenge-pack validate path/to/pack.yaml
agentclash challenge-pack validate path/to/pack.yaml --json
```

Human output prints `Challenge pack is valid` or `Challenge pack has errors`. Use `--json` for structured `valid` and `errors` fields.

## Evaluation Spec Shape
Deterministic scoring lives under `version.evaluation_spec`.

```yaml
version:
  evaluation_spec:
    name: support-answer-scoring
    version_number: 1
    judge_mode: deterministic
    validators:
      - key: mentions_refund_window
        type: contains
        target: final_output
        expected_from: literal:30 days
    metrics:
      - key: latency_ms
        type: numeric
        collector: run_total_latency_ms
        unit: ms
    scorecard:
      strategy: weighted
      dimensions:
        - key: correctness
          source: validators
          validators:
            - mentions_refund_window
          weight: 1
```

Required fields:

- `name`: required non-empty string.
- `version_number`: required integer greater than `0`.
- `judge_mode`: `deterministic`, `llm_judge`, or `hybrid`; use `deterministic` for this skill.
- `validators`: required and must contain at least one validator.
- `scorecard.dimensions`: required and must contain at least one dimension.

Optional scoring sections used by this skill:

- `metrics`: run metric declarations.
- `post_execution_checks`: file or directory capture declarations used by file validators.
- `runtime_limits`, `pricing`, and `behavioral` exist, but use focused skills unless they are directly needed for scoring.

## Validator Fields
Every validator has this source-backed shape:

```yaml
validators:
  - key: stable_validator_key
    type: exact_match
    target: final_output
    expected_from: literal:approved
    config: {}
```

Fields:

- `key`: required, trimmed, and unique.
- `type`: required and must be one of the supported validator types below.
- `target`: required supported evidence reference.
- `expected_from`: required for most validators; omitted for `file_exists`, `file_json_schema`, `directory_structure`, and `code_execution`.
- `config`: optional JSON/YAML object interpreted by the validator type.

There is no validator-level `failure_message`, `pass_message`, or custom result text field. Results emit `state`, `verdict`, `normalized_score`, `reason`, `raw_output`, `target`, `expected_from`, `actual_value`, and `expected_value`; the reason text is produced by the scorer.

## Supported Validator Types
These are the exact validator type strings accepted by the scoring model:

```text
exact_match
contains
regex_match
json_schema
json_path_match
boolean_assert
fuzzy_match
numeric_match
normalized_match
token_f1
math_equivalence
bleu_score
rouge_score
chrf_score
file_content_match
file_exists
file_json_schema
directory_structure
code_execution
```

Do not use `has_json`, `json_equals`, `semantic_match`, `unit_test`, `shell`, or provider-specific names; the validator rejects unknown `type` values.

## Evidence References
Validator `target` and required `expected_from` values must use supported evidence references:

- `final_output`
- `run.final_output`
- `challenge_input`
- `case.payload`
- `case.payload.<field>`
- `case.inputs.<input_key>`
- `case.expectations.<expectation_key>`
- `artifact.<artifact_key>[.<field>]`
- `file:<post_execution_check_key>`
- `literal:<value>`

Use `literal:` for inline expected values. Use `case.expectations.<key>` or `artifact.<artifact_key>.path` when the expected value should come from case evidence rather than the skill text.

## Common Text And JSON Validators
Use these when final output or case evidence is already text or JSON.

```yaml
validators:
  - key: exact_decision
    type: exact_match
    target: case.payload.expected_decision
    expected_from: literal:approve

  - key: contains_policy_term
    type: contains
    target: final_output
    expected_from: literal:refund window

  - key: matches_ticket_pattern
    type: regex_match
    target: final_output
    expected_from: literal:TICKET-[0-9]+

  - key: response_is_schema_valid
    type: json_schema
    target: final_output
    expected_from: 'literal:{"type":"object","required":["decision"],"properties":{"decision":{"type":"string"}}}'

  - key: decision_is_approved
    type: json_path_match
    target: final_output
    expected_from: 'literal:{"path":"$.decision","comparator":"equals","value":"approve"}'

  - key: escalation_flag
    type: boolean_assert
    target: case.payload.should_escalate
    expected_from: literal:true
```

`json_path_match` expected values are either a JSON object with `path`, optional `comparator`, and optional `value`, or a path string that starts with `$` for an existence check. Supported comparators are `equals`, `contains`, `greater_than`, `less_than`, and `exists`.

## Similarity, Numeric, And Math Validators
These validators accept typed `config` fields.

```yaml
validators:
  - key: answer_fuzzy
    type: fuzzy_match
    target: final_output
    expected_from: case.expectations.answer
    config:
      threshold: 0.85
      case_insensitive: true
      normalize: true

  - key: total_matches
    type: numeric_match
    target: case.payload.agent_total
    expected_from: case.expectations.expected_total
    config:
      absolute_tolerance: 0.01
      extract_number: true

  - key: normalized_phrase
    type: normalized_match
    target: final_output
    expected_from: literal:refund window is 30 days
    config:
      pipeline:
        - trim
        - lowercase
        - collapse_whitespace

  - key: token_overlap
    type: token_f1
    target: final_output
    expected_from: case.expectations.answer
    config:
      threshold: 0.75
      normalize: true
      remove_articles: true
      remove_punctuation: true

  - key: formula_equivalent
    type: math_equivalence
    target: final_output
    expected_from: literal:x^2 + 2*x + 1
    config:
      comparison_mode: symbolic
```

Source-backed config notes:

- `fuzzy_match.threshold` and `token_f1.threshold` must be between `0` and `1` when set.
- `numeric_match` accepts `absolute_tolerance`, `relative_tolerance`, `extract_number`, `significant_digits`, `tolerance_mode`, and `tolerance`; tolerances must be non-negative, and `significant_digits` must be greater than `0` when set.
- `normalized_match.pipeline` accepts `trim`, `lowercase`, `collapse_whitespace`, `strip_punctuation`, `strip_currency`, `strip_formatting`, `normalize_unicode`, `remove_articles`, `sort_words`, and `sort_lines`.
- `math_equivalence.comparison_mode` must be `symbolic` or `numeric`; `tolerance` must be non-negative.

## Generation-Style Validators
Use these for text similarity against references when exact wording is not required.

```yaml
validators:
  - key: bleu_reference_overlap
    type: bleu_score
    target: final_output
    expected_from: case.expectations.answer
    config:
      threshold: 0.4
      max_ngram: 4
      smoothing: method1

  - key: rouge_summary_overlap
    type: rouge_score
    target: final_output
    expected_from: case.expectations.answer
    config:
      threshold: 0.5
      variant: rouge-l

  - key: chrf_summary_overlap
    type: chrf_score
    target: final_output
    expected_from: case.expectations.answer
    config:
      threshold: 0.5
      char_order: 6
```

Config validation:

- `bleu_score.smoothing` must be `none` or `method1`; `max_ngram` must be greater than `0`.
- `rouge_score.variant` must be `rouge-1`, `rouge-2`, or `rouge-l`; `beta` must be greater than `0` when set.
- `chrf_score.char_order` and `chrf_score.beta` must be greater than `0` when set.

## File And Directory Validators
File validators must use `target: file:<post_execution_check_key>`. Declare the capture first with `version.evaluation_spec.post_execution_checks`.

```yaml
version:
  execution_mode: native
  tool_policy:
    allowed_tool_kinds:
      - file
      - build
  evaluation_spec:
    name: file-scoring
    version_number: 1
    judge_mode: deterministic
    post_execution_checks:
      - key: generated_summary
        type: file_capture
        path: /workspace/summary.json
      - key: project_listing
        type: directory_listing
        path: /workspace
        recursive: true
    validators:
      - key: summary_exists
        type: file_exists
        target: file:generated_summary
      - key: summary_matches_schema
        type: file_json_schema
        target: file:generated_summary
        config:
          schema:
            type: object
            required:
              - decision
      - key: no_secret_file
        type: directory_structure
        target: file:project_listing
        config:
          forbidden_files:
            - .env
      - key: summary_mentions_decision
        type: file_content_match
        target: file:generated_summary
        expected_from: literal:decision
        config:
          match_mode: contains
```

File validator rules:

- `file_content_match` requires `expected_from` and supports `match_mode`: `exact`, `contains`, `regex`, `not_contains`, or `json_equal`; default is `contains`.
- `file_exists` defaults to `must_exist: true`; set `config.must_exist: false` when the file must be absent.
- `file_json_schema` requires `config.schema`.
- `directory_structure` requires config and supports `required_files`, `forbidden_files`, and `required_directories`.
- If any validator target starts with `file:` and checks are declared, the key must match a `post_execution_checks[].key`.

## Code Execution Validator
`code_execution` is a file validator. Its `target` must reference a `file_capture`, not a `directory_listing`, and `config.test_command` is required.

```yaml
post_execution_checks:
  - key: generated_code
    type: file_capture
    path: /workspace/app.py
validators:
  - key: generated_code_tests
    type: code_execution
    target: file:generated_code
    config:
      test_command: python -m pytest tests/ -q
      timeout_ms: 30000
      scoring: fraction_passed
      pass_threshold: 0.8
```

Source-backed config:

- `test_command`: required non-empty string.
- `timeout_ms`: optional integer greater than `0`.
- `scoring`: `fraction_passed` or `all_or_nothing`; `pass_at_k` is defined but currently rejected.
- `pass_threshold`: optional number between `0` and `1`; default effective threshold is `1.0`.

## Metrics
Metrics have `key`, `type`, `collector`, and optional `unit`.

```yaml
metrics:
  - key: latency_ms
    type: numeric
    collector: run_total_latency_ms
    unit: ms
  - key: validator_rate
    type: numeric
    collector: validator_pass_rate
```

Metric `type` must be `numeric`, `text`, or `boolean`. The schema accepts `text`, but the current implemented collectors produce numeric or boolean values. The scorer currently implements these collectors:

```text
run_total_latency_ms
run_ttft_ms
run_input_tokens
run_output_tokens
run_total_tokens
run_tool_call_count
run_agent_tokens
run_race_context_tokens
run_model_cost_usd
run_completed_successfully
run_failure_count
behavioral_recovery_score
behavioral_exploration_efficiency_score
behavioral_error_cascade_score
behavioral_scope_adherence_score
behavioral_confidence_calibration_score
validator_pass_rate
```

Validation rejects `behavioral_confidence_calibration_score` for metrics until confidence reporting lands, even though the engine has a collector branch. Avoid it in new packs.

## Scorecard Dimensions
Use object-form dimensions for source-fidelity and explicit routing.

```yaml
scorecard:
  strategy: weighted
  pass_threshold: 0.8
  dimensions:
    - key: correctness
      source: validators
      validators:
        - mentions_refund_window
        - decision_is_approved
      weight: 0.8
      gate: true
      pass_threshold: 0.9
    - key: speed
      source: metric
      metric: latency_ms
      better_direction: lower
      normalization:
        target: 1000
        max: 60000
      weight: 0.2
```

Dimension fields:

- `key`: required and unique.
- `source`: `validators`, `metric`, `reliability`, `latency`, `cost`, `behavioral`, or `llm_judge`.
- `validators`: optional list of validator keys when `source: validators`; omitted means average all validators.
- `metric`: required when `source: metric` and must reference `metrics[].key`.
- `better_direction`: required for `metric`, `latency`, and `cost`; must be `higher` or `lower`.
- `normalization.target` and `normalization.max`: required for `metric`, `latency`, and `cost`.
- `weight`: optional and must be greater than or equal to `0`.
- `gate`: optional boolean.
- `pass_threshold`: required when `gate: true` or when `strategy: binary`; must be between `0` and `1`.
- `judge_key`: only valid when `source: llm_judge`; use the LLM judges skill for that path.

Strategy rules:

- Missing `strategy` defaults to `weighted`.
- `weighted`: optional scorecard-level `pass_threshold`; explicit gates are allowed.
- `binary`: every dimension is implicitly gated, every dimension needs `pass_threshold`, and scorecard-level `pass_threshold` must not be set.
- `hybrid`: requires at least one `gate: true`; gates must pass and the non-gate weighted average must clear any scorecard-level threshold.

## Result Interpretation
Validator results can be:

- `verdict: pass`: evidence was available and the validator condition passed.
- `verdict: fail`: evidence was available and the condition failed.
- `verdict: error`: evidence existed but parsing, config, regex, schema, JSONPath, or execution-result interpretation errored.
- unavailable state with no verdict: target or expected evidence could not be resolved.

Each available validator contributes `normalized_score` on a `0..1` scale. `source: validators` dimensions average the scoped validator scores; if scoped validators are unavailable, the dimension is unavailable.

## Common Validation And Scoring Failures
- `validators` is empty.
- Duplicate validator `key`.
- Unknown validator type such as `has_json`.
- Missing `target`.
- Missing `expected_from` for a validator that requires it.
- `target` or `expected_from` is not a supported evidence reference.
- A file validator targets `final_output` instead of `file:<post_execution_check_key>`.
- A `file:` target references a missing `post_execution_checks` key.
- `code_execution` targets a `directory_listing`.
- `file_json_schema` omits `config.schema`; this becomes a scoring error if it slips past pack validation.
- `directory_structure` omits `config`; this becomes a scoring error if it slips past pack validation.
- `code_execution` omits `config.test_command`; validation catches this when config is present, and scoring cannot produce a useful result without it.
- `metric` dimensions omit `normalization`.
- `binary` strategy sets scorecard-level `pass_threshold`.
- `hybrid` strategy has no gated dimension.
- A non-`llm_judge` dimension includes `judge_key`.

## Authoring Procedure
1. Identify the evidence source for each behavior: final output, case payload, case input, expectation, artifact metadata, captured file, or literal.
2. Pick the simplest supported validator type that proves the claim.
3. Add `expected_from` unless the validator type explicitly does not require it.
4. Keep file checks under `post_execution_checks` and target them with `file:<key>`.
5. Group validators into scorecard dimensions with `source: validators`.
6. Add numeric metrics only when the scorecard needs latency, cost, token, tool, completion, failure, or pass-rate signals.
7. Add gates and pass thresholds only for hard requirements.
8. Run `agentclash challenge-pack validate path/to/pack.yaml --json` and fix every returned field error.
9. Report which validators are scored, which are gates, and which evidence refs each one reads.

## Safety
- Do not put secrets in `literal:` expected values, captured files, artifact metadata, or validator config.
- Keep `file_capture` paths narrow; captured content becomes scoring evidence.
- Prefer deterministic fixture expectations over live mutable data.
- Use regex and JSONPath carefully so failures explain behavior instead of implementation trivia.
- Avoid scoring on private customer data unless retention and access are approved.

## Report Back Format
```text
Evaluation spec:
Validator summary:
- key:
  type:
  target:
  expected_from:
  config:
  score dimension:
  gate: <yes/no>
Metrics:
Scorecard:
- strategy:
- dimensions:
File captures:
Evidence references:
Validation command:
Validation result:
Expected result fields:
Open issues:
```

## Related Skills
- `agentclash-challenge-pack-input-sets`
- `agentclash-challenge-pack-artifacts`
- `agentclash-challenge-pack-tools-sandbox`
- `agentclash-challenge-pack-llm-judges`
- `agentclash-challenge-pack-validation-publish`
- `agentclash-eval-runner`
- `agentclash-scorecard-reader`
````

---

# Challenge Pack Tools Sandbox Skill

Use when defining AgentClash challenge pack tool access, sandbox runtime needs, filesystem expectations, network policy, command execution, and secret references.

Source: https://www.agentclash.dev/docs/agent-skills/challenge-pack-skills/agentclash-challenge-pack-tools-sandbox
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-tools-sandbox

Canonical source: `web/content/agent-skills/challenge-pack-skills/agentclash-challenge-pack-tools-sandbox/SKILL.md`

Markdown export: `/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-tools-sandbox`

## Use This Skill When

Use when defining AgentClash challenge pack tool access, sandbox runtime needs, filesystem expectations, network policy, command execution, and secret references.

## Full SKILL.md

````markdown
---
name: agentclash-challenge-pack-tools-sandbox
description: Use when defining AgentClash challenge pack tool access, sandbox runtime needs, filesystem expectations, network policy, command execution, and secret references.
metadata:
  agentclash.role: challenge-pack-tools
  agentclash.version: "1"
  agentclash.requires_cli: "true"
---

# AgentClash Challenge Pack Tools And Sandbox

## Purpose
Define the native execution surface a challenge pack needs: pack-defined custom tools, broad tool policy, sandbox network/package/env settings, and safe secret references.

Use this skill only when a pack truly needs native files, tools, network, packages, or sandbox behavior. Keep the runtime surface narrow enough that failures are attributable to the agent, not to an over-broad environment.

## Use When
- A challenge pack needs top-level `tools.custom`.
- The pack needs `version.tool_policy.allowed_tool_kinds`.
- The pack needs `version.sandbox` for network access, CIDR allowlists, environment variables, apt packages, or a sandbox template.
- A coding agent needs exact source-backed YAML shapes without reading the AgentClash source repo.
- A reviewer needs to check that no raw secrets or unsupported tool kinds are being introduced.

## Do Not Use When
- The pack is `prompt_eval` and only needs prompt/final-output evaluation.
- The task is workspace infrastructure setup with `agentclash infra tool ...`; use `agentclash-runtime-resources-setup`.
- The task is artifact declaration, scoring validators, LLM judges, validation/publish, or eval running; use the focused downstream skills.

## Environment
Use hosted production for CLI examples unless the user intentionally targets a local or self-hosted backend.

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
```

## Validation Commands
Validate after adding or changing tools, tool policy, or sandbox settings.

```bash
agentclash challenge-pack validate path/to/pack.yaml
agentclash challenge-pack validate path/to/pack.yaml --json
```

Human output prints `Challenge pack is valid` or `Challenge pack has errors`. Use `--json` for structured `valid` and `errors` fields.

## Execution Mode Rules
`prompt_eval` packs cannot use challenge-pack tools or sandbox settings.

Do not include these in `prompt_eval`:
- top-level `tools`
- `version.tool_policy`
- `version.sandbox`

Use `native` when the task needs files, tool calls, network policy, extra packages, sandbox templates, file validators, directory checks, or code execution.

```yaml
version:
  number: 1
  execution_mode: native
```

## Tool Policy
`version.tool_policy.allowed_tool_kinds` accepts only these broad kinds:

```yaml
version:
  tool_policy:
    allowed_tool_kinds:
      - browser
      - build
      - data
      - file
      - network
```

Supported values are exactly `browser`, `build`, `data`, `file`, and `network`. Do not use `shell`; the current validator rejects it.

Use the narrowest set possible:
- `file` for reading/writing workspace files.
- `build` for build/test style operations.
- `network` for outbound HTTP or API access.
- `browser` for browser interaction.
- `data` for structured data access tools.

## Pack-Defined Custom Tools
Challenge-pack custom tools live at top-level `tools.custom`, not under `version`.

```yaml
tools:
  custom:
    - name: check_inventory
      description: Check inventory for a SKU.
      parameters:
        type: object
        properties:
          sku:
            type: string
        required:
          - sku
      implementation:
        primitive: http_request
        args:
          method: GET
          url: "https://api.example.com/inventory/${sku}"
          headers:
            Authorization: "Bearer ${secrets.INVENTORY_API_KEY}"
```

Source-backed fields:
- `tools.custom[]` entries are the supported pack-defined tool shape.
- `name` should be stable and unique in the pack.
- `parameters` must be valid JSON Schema when provided. If omitted, validation defaults to an empty object schema, but authoring explicit parameters is clearer.
- `implementation` is required.
- Non-`mock` implementations require `implementation.primitive`.
- Non-`mock` implementations require `implementation.args`, and `args` must be a JSON/YAML object.
- `implementation.primitive` cannot equal the tool's own `name`.
- Tool delegation cycles are rejected, and delegation depth greater than 8 is rejected.

Mock tools are the only exception to primitive/args validation:

```yaml
tools:
  custom:
    - name: fake_lookup
      parameters:
        type: object
      implementation:
        type: mock
```

## Template Placeholders And Secrets
Template placeholders are validated inside `implementation.args`.

Allowed placeholder forms:
- `${sku}` or `${sku.id}` when `sku` is declared in `parameters.properties`.
- `${parameters}` for the full parameters object.
- `${secrets.INVENTORY_API_KEY}` for a runtime secret reference.

Rejected placeholder forms:
- `${missing}` when `missing` is not declared in `parameters.properties`.
- `${}` empty placeholders.
- unclosed placeholders such as `${sku`.

Never paste raw secret values into YAML, chat, commits, or examples. Use secret names only. If a secret value is not already configured, ask the user to set it through the workspace secret flow without revealing the value in chat.

## Sandbox Settings
`version.sandbox` is valid only for `native` packs.

```yaml
version:
  execution_mode: native
  sandbox:
    network_access: true
    network_allowlist:
      - 203.0.113.0/24
    env_vars:
      DATASET_MODE: fixture
      API_BASE_URL: https://api.example.com
    additional_packages:
      - jq
      - python3-venv
    sandbox_template_id: codex
```

Source-backed sandbox fields:
- `network_access`: boolean.
- `network_allowlist`: list of CIDR ranges. Hostnames such as `api.example.com` are not valid allowlist entries.
- `env_vars`: string map. Keys must match `[A-Za-z_][A-Za-z0-9_]*`.
- `additional_packages`: apt-style package names.
- `sandbox_template_id`: optional template identifier string.

Keep `network_access: false` or omit sandbox network settings unless the case truly needs outbound network. If network is needed, use the smallest CIDR allowlist available.

## Filesystem Expectations
`version.filesystem` exists as a raw map on the bundle model, but the current challenge-pack validator does not define a source-backed schema for it. Do not invent `version.filesystem` subfields in a skill-authored pack. Prefer explicit assets, case inputs, sandbox package/env settings, and scoring file evidence until the user or product docs provide an exact filesystem contract.

Use file-related behavior through:
- `version.assets` and case `inputs[].artifact_key`.
- `version.tool_policy.allowed_tool_kinds: [file]`.
- scoring file validators that target `file:<post_execution_check_key>`.
- `version.evaluation_spec.post_execution_checks` for file or directory capture.

## Compatibility Checklist
Before validating:

- Execution mode is `native` if `tools`, `tool_policy`, or `sandbox` are present.
- `allowed_tool_kinds` contains only `browser`, `build`, `data`, `file`, and `network`.
- No `shell` tool kind is present.
- Every custom tool has a stable `name`, parameter schema, `implementation.primitive`, and object `implementation.args`, unless it is a deliberate `type: mock` tool.
- Every `${...}` placeholder in tool args is declared as a parameter, is `${parameters}`, or starts with `${secrets.}`.
- No raw secret values are present.
- `network_allowlist` uses CIDR ranges.
- `env_vars` keys are valid environment variable names.
- `additional_packages` names are valid apt package names.
- Native settings are backed by a smoke case that proves the environment actually works.

## Common Validation Failures
- A `prompt_eval` pack includes `tools`, `version.tool_policy`, or `version.sandbox`.
- `version.tool_policy.allowed_tool_kinds` includes `shell`, `code`, or provider-specific tool names.
- `allowed_tool_kinds` is not an array of strings.
- A non-mock custom tool omits `implementation.primitive` or `implementation.args`.
- `implementation.args` is a string/list instead of an object.
- Tool args use unknown placeholders such as `${order_id}` without declaring `order_id` in `parameters.properties`.
- A tool delegates to itself or creates a delegation cycle.
- `network_allowlist` contains a hostname instead of CIDR.
- `env_vars` contains a key like `api-key` that is not a valid environment variable name.
- `additional_packages` includes an invalid apt package name.

## Authoring Procedure
1. Confirm whether `prompt_eval` is enough. If yes, omit tools and sandbox.
2. If native behavior is required, set `version.execution_mode: native`.
3. Add only the needed `allowed_tool_kinds`.
4. Define `tools.custom` only for pack-defined tools; use workspace infra skills for reusable workspace tools.
5. Write explicit JSON Schema parameters for each custom tool.
6. Use `${parameter}` and `${secrets.KEY}` placeholders in `implementation.args`; never raw secrets.
7. Add `version.sandbox` only for real network/env/package/template requirements.
8. Add a smoke case that proves the tool or sandbox dependency is reachable.
9. Run `agentclash challenge-pack validate ... --json` and fix every returned field error.
10. Hand off to artifacts, scoring, or validation/publish skills.

## Report Back Format
```text
Execution mode:
Tool policy:
Custom tools:
- name:
  primitive:
  parameters:
  secret references:
Sandbox:
Network:
Packages:
Filesystem/artifact dependencies:
Smoke case:
Validation command:
Validation result:
Ready for scoring/publish: <yes/no>
Open issues:
```

## Related Skills
- `agentclash-runtime-resources-setup`
- `agentclash-challenge-pack-yaml-author`
- `agentclash-challenge-pack-input-sets`
- `agentclash-challenge-pack-artifacts`
- `agentclash-challenge-pack-scoring-validators`
- `agentclash-challenge-pack-validation-publish`
````

---

# Challenge Pack Validation Publish Skill

Use when validating AgentClash challenge pack YAML, fixing schema/scoring/tool/asset errors, publishing runnable pack versions, recording returned IDs, and preparing next eval commands.

Source: https://www.agentclash.dev/docs/agent-skills/challenge-pack-skills/agentclash-challenge-pack-validation-publish
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-validation-publish

Canonical source: `web/content/agent-skills/challenge-pack-skills/agentclash-challenge-pack-validation-publish/SKILL.md`

Markdown export: `/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-validation-publish`

## Use This Skill When

Use when validating AgentClash challenge pack YAML, fixing schema/scoring/tool/asset errors, publishing runnable pack versions, recording returned IDs, and preparing next eval commands.

## Full SKILL.md

````markdown
---
name: agentclash-challenge-pack-validation-publish
description: Use when validating AgentClash challenge pack YAML, fixing schema/scoring/tool/asset errors, publishing runnable pack versions, recording returned IDs, and preparing next eval commands.
metadata:
  agentclash.role: challenge-pack-publication
  agentclash.version: "1"
  agentclash.requires_cli: "true"
---

# AgentClash Challenge Pack Validation and Publish

## Purpose
Validate a complete AgentClash challenge pack against the real workspace-backed parser, fix any source-reported errors, publish a runnable version only when intended, and report the IDs needed to start an eval.

## Use When
- A challenge pack YAML file is ready for validation after planning, YAML authoring, inputs, tools, artifacts, scoring, and judges are in place.
- The user asks whether a pack is publishable.
- The user asks to publish a pack and needs the returned pack/version/evaluation/input-set IDs.
- A reviewer needs exact next commands for `agentclash eval start` or `agentclash run create`.

## Do Not Use When
- The pack idea is still being designed; use `agentclash-challenge-pack-planner`.
- The YAML structure is being written from scratch; use `agentclash-challenge-pack-yaml-author`.
- The task is only to run an already published eval; use `agentclash-eval-runner`.
- The task is to upload standalone workspace artifacts; use `agentclash-challenge-pack-artifacts`.

## Inputs Needed
- Pack YAML path.
- Target workspace ID or a configured workspace from `agentclash link`, `agentclash workspace use`, `--workspace`, `AGENTCLASH_WORKSPACE`, or `.agentclash.yaml`.
- Whether publish is explicitly allowed now.
- Deployment ID/name for the next run command, if the user wants a ready-to-run command.

## Environment
Use hosted production by default unless the user intentionally targets local or self-hosted infrastructure:

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
```

`agentclash challenge-pack validate` is not an offline-only local parser in the current CLI. It reads the YAML file, requires auth and a workspace, then posts the raw bundle to:

```text
POST /v1/workspaces/<workspace-id>/challenge-packs/validate
```

`agentclash challenge-pack publish` posts the same raw bundle to:

```text
POST /v1/workspaces/<workspace-id>/challenge-packs
```

Use `--json` for coding-agent workflows because human output intentionally omits several IDs.

## CLI Surface
Use the full command names in docs and automation. `challenge-pack` also has the `cp` alias, but the full form is clearer for agents.

```bash
agentclash challenge-pack validate path/to/pack.yaml --json
agentclash challenge-pack publish path/to/pack.yaml --json
agentclash challenge-pack list --json
```

Exact command notes from the CLI:

- `validate` and `publish` each take exactly one file path.
- `validate` and `publish` have no command-specific flags today; `--json`, `--workspace`, `--api-url`, and `--quiet` are global flags.
- `publish` does not implement a challenge-pack-specific dry run, confirmation flag, or update flag.
- `list --json` returns visible packs for the current workspace, with each pack's runnable `versions`.
- The API request body is capped at 1 MiB.

## Validation Procedure
1. Set hosted API URL unless the workflow is explicitly local or self-hosted.
2. Verify the workspace context before making workspace-backed calls.
3. Run validation with `--json`.
4. If the command fails, inspect both stdout and stderr. Current validation failures use HTTP 400, so the CLI can report the response body through its generic API-error path instead of returning a clean JSON object on stdout.
5. Fix every `field` and `message` pair reported by the API.
6. Re-run validation until it succeeds.
7. Only then publish, and only when the user has already approved publishing.

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
agentclash auth status
agentclash workspace use <WORKSPACE_ID>
agentclash challenge-pack validate path/to/pack.yaml --json
```

On success, `validate --json` returns:

```json
{
  "valid": true,
  "errors": null
}
```

On validation failure, the API response shape is:

```json
{
  "valid": false,
  "errors": [
    {
      "field": "version.evaluation_spec.validators[0].type",
      "message": "must be one of ..."
    }
  ]
}
```

Because invalid validation is returned with HTTP 400 today, do not assume a failing CLI invocation will print that object as successful structured stdout. Still use its `field` and `message` details when they are present.

## What Validation Checks
Validation parses the YAML bundle, normalizes legacy case data, runs bundle validation, validates scoring spec fields, and checks stored artifact IDs against the selected workspace.

Pack and version checks:

- `pack.slug`, `pack.name`, and `pack.family` are required.
- `version.number` must be greater than 0.
- `version.execution_mode` can be omitted, `native`, or `prompt_eval`.
- `prompt_eval` packs must not include top-level `tools`, `version.sandbox`, or `version.tool_policy`.
- At least one challenge is required.

Challenge checks:

- Every challenge needs `key`, `title`, `category`, and `difficulty`.
- `difficulty` must be exactly `easy`, `medium`, `hard`, or `expert`.
- Challenge keys must be unique.
- `artifact_refs[].key` must reference a declared `version.assets[].key`.

Input set checks:

- Every input set needs `key`, `name`, and at least one `cases` entry.
- Input set keys must be unique.
- All cases in the same input set must reference the same `challenge_key`.
- `challenge_key` must reference a declared challenge.
- `case_key` is the current field; legacy `item_key` is normalized but should not be newly authored.
- Case keys must be unique per challenge inside the input set.
- Case `inputs[].key`, `expectations[].key`, and local asset keys must be unique where they appear.
- `inputs[].artifact_key`, `expectations[].artifact_key`, and `source: artifact:<key>` must reference `version.assets`.
- `expectations[].source` may use `input:<input_key>` or `artifact:<asset_key>`; other prefixes are rejected.

Tools, sandbox, and artifact checks:

- `version.tool_policy.allowed_tool_kinds` must be an array containing only `browser`, `build`, `data`, `file`, and `network`.
- Custom composed tools must have a valid `implementation.primitive`, `implementation.args`, and parameter schema.
- Tool delegation cycles and delegation chains deeper than 8 are rejected.
- `version.sandbox.network_allowlist[]` entries must be valid CIDR strings.
- `version.sandbox.env_vars` keys must match `[A-Za-z_][A-Za-z0-9_]*`.
- `version.sandbox.additional_packages[]` values must look like apt package names.
- Each asset needs `key` and `path`; duplicate asset keys in the same list are rejected.
- If an asset includes `artifact_id`, validation checks that the artifact exists and belongs to the selected workspace.

Scoring checks:

- Errors from `version.evaluation_spec` are prefixed with `version.` in challenge-pack validation output.
- Strict evaluation-spec decoding catches unknown fields before scoring validation runs.
- Keep deterministic, LLM judge, and hybrid scoring mode rules aligned with `agentclash-challenge-pack-scoring-validators` and `agentclash-challenge-pack-llm-judges`.

## Fix Patterns
- `pack.family is required`: set a stable family string, usually the pack slug or product/workload family.
- `version.execution_mode must be one of "native", "prompt_eval"`: use exactly `native` or `prompt_eval`, or omit the field.
- `tools must be empty when version.execution_mode is prompt_eval`: switch to `native` or remove top-level `tools`.
- `version.tool_policy.allowed_tool_kinds[...] must be one of browser, build, data, file, network`: remove unsupported kinds such as `shell`.
- `input_sets[...].cases[...].challenge_key must reference the same challenge as the other cases in this input set`: split cases by challenge into separate input sets.
- `case_key is required`: add `case_key`; do not author new `item_key` entries.
- `expectations[...].source must start with input: or artifact:`: use `input:<input_key>` or `artifact:<version_asset_key>`.
- `artifact_id must reference an existing artifact`: upload or find the artifact first with the artifact CLI, then use its UUID.
- `artifact_id must belong to the workspace`: switch workspace or use an artifact from the selected workspace.
- `decode evaluation spec from yaml`: look for a misspelled or unsupported scoring field; this may surface as a CLI/API error rather than a normal `errors[]` item.

## Publish Procedure
Publish is a workspace mutation that creates a runnable challenge-pack version. Do it only after validation passes and the user has approved publishing.

```bash
agentclash challenge-pack validate path/to/pack.yaml --json
agentclash challenge-pack publish path/to/pack.yaml --json
```

On success, `publish --json` returns:

```json
{
  "challenge_pack_id": "<CHALLENGE_PACK_ID>",
  "challenge_pack_version_id": "<CHALLENGE_PACK_VERSION_ID>",
  "evaluation_spec_id": "<EVALUATION_SPEC_ID>",
  "input_set_ids": ["<CHALLENGE_INPUT_SET_ID>"],
  "bundle_artifact_id": "<BUNDLE_ARTIFACT_ID>"
}
```

`bundle_artifact_id` is optional. It is present when the backend stores the authored YAML bundle as a workspace artifact.

Without `--json`, the current CLI only prints the pack ID and version ID. Always use `--json` when you need `evaluation_spec_id`, `input_set_ids`, or `bundle_artifact_id`.

Publish stores:

- a workspace-scoped challenge pack, keyed by `pack.slug`;
- a runnable challenge pack version using `version.number`;
- one evaluation spec from `version.evaluation_spec`;
- one challenge input set row for each authored `input_sets[]` entry;
- one challenge input item row for each case;
- optionally, a `challenge_pack_bundle` artifact for the source YAML.

## Publish Failure Modes
- `challenge_pack_version_exists`: the same pack already has that `version.number`; increment `version.number` or intentionally target a different pack slug.
- `challenge_pack_metadata_conflict`: an existing pack with the same workspace and slug has a different `pack.name` or `pack.family`; keep metadata stable or choose a new slug.
- Billing or entitlement errors: the workspace cannot publish private challenge packs under its current plan.
- Authorization errors: the current caller cannot publish to the selected workspace.
- Validation errors: publish re-runs the same bundle and stored-artifact checks as validate.
- Oversized bundle errors: keep the YAML body at or below the current 1 MiB request limit.

## Next Eval Commands
Record the IDs from `publish --json` before starting a run.

Workflow-first command, with selectors accepted by the CLI:

```bash
agentclash eval start \
  --pack <PACK_ID_OR_SLUG_OR_EXACT_NAME> \
  --pack-version <VERSION_ID_OR_VERSION_NUMBER> \
  --input-set <INPUT_SET_ID_OR_KEY_OR_EXACT_NAME> \
  --deployment <DEPLOYMENT_ID_OR_EXACT_NAME> \
  --follow
```

Lower-level non-interactive command, using IDs:

```bash
agentclash run create \
  --challenge-pack-version <CHALLENGE_PACK_VERSION_ID> \
  --input-set <CHALLENGE_INPUT_SET_ID> \
  --deployments <AGENT_DEPLOYMENT_ID> \
  --follow
```

When the published version has multiple input sets, pass `--input-set` in non-interactive workflows. `agentclash eval start` can resolve an input set by ID, key, or exact name; `agentclash run create` expects the input set ID.

## Safety Notes
- Do not publish unless the user has already approved the workspace mutation.
- Do not include raw secret values in pack YAML, assets, examples, comments, or reports.
- `publish` does not upload local files referenced by `path`; use the artifact CLI first when a stored `artifact_id` is required.
- Prefer small smoke input sets before publishing large benchmark suites.
- Keep `pack.slug`, `pack.family`, challenge keys, input set keys, and case keys stable after publish so scorecards, regressions, and CI gates remain comparable.
- Avoid publishing customer-sensitive fixtures unless retention and workspace access are approved.

## Report Back Format
```text
Validation: <pass/fail>
Workspace: <workspace-id or configured default>
Pack file: <path>
Errors fixed:
- <field>: <message/fix>
Published: <yes/no>
Challenge pack ID: <id>
Challenge pack version ID: <id>
Evaluation spec ID: <id>
Input set IDs:
- <id> (<key/name if known>)
Bundle artifact ID: <id or not returned>
Next eval command:
Next run command:
Notes: <conflicts, entitlement/auth caveats, skipped publish reason>
```

## Related Skills
- `agentclash-cli-setup`
- `agentclash-challenge-pack-planner`
- `agentclash-challenge-pack-yaml-author`
- `agentclash-challenge-pack-input-sets`
- `agentclash-challenge-pack-tools-sandbox`
- `agentclash-challenge-pack-artifacts`
- `agentclash-challenge-pack-scoring-validators`
- `agentclash-challenge-pack-llm-judges`
- `agentclash-eval-runner`

## Related Docs
- `/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-yaml-author`
- `/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-artifacts`
- `/docs-md/agent-skills/agentclash-eval-runner`
````

---

# Challenge Pack YAML Author Skill

Use when writing or editing AgentClash challenge pack YAML, including pack/version metadata, execution mode, challenges, cases, input sets, scoring blocks, tools, sandbox settings, assets, and validation handoff.

Source: https://www.agentclash.dev/docs/agent-skills/challenge-pack-skills/agentclash-challenge-pack-yaml-author
Markdown export: https://www.agentclash.dev/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-yaml-author

Canonical source: `web/content/agent-skills/challenge-pack-skills/agentclash-challenge-pack-yaml-author/SKILL.md`

Markdown export: `/docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-yaml-author`

## Use This Skill When

Use when writing or editing AgentClash challenge pack YAML, including pack/version metadata, execution mode, challenges, cases, input sets, scoring blocks, tools, sandbox settings, assets, and validation handoff.

## Full SKILL.md

````markdown
---
name: agentclash-challenge-pack-yaml-author
description: Use when writing or editing AgentClash challenge pack YAML, including pack/version metadata, execution mode, challenges, cases, input sets, scoring blocks, tools, sandbox settings, assets, and validation handoff.
metadata:
  agentclash.role: challenge-pack-authoring
  agentclash.version: "1"
  agentclash.requires_cli: "true"
---

# AgentClash Challenge Pack YAML Author

## Purpose
Write challenge-pack YAML that matches the current AgentClash parser and validators, without requiring access to the AgentClash source repo.

Use this skill after planning. The output should be a concrete YAML file plus the exact validation commands a coding agent should run before publish.

## Use When
- A challenge-pack plan needs to become valid YAML.
- An existing pack YAML needs source-compatible edits.
- A coding agent needs the exact YAML object shape for `pack`, `version`, `challenges`, `input_sets`, scoring, tools, sandbox, assets, cases, inputs, and expectations.
- The agent will later hand the file to `agentclash-challenge-pack-validation-publish`.

## Do Not Use When
- The user only has a vague eval idea; use `agentclash-challenge-pack-planner` first.
- The YAML is finished and only needs validation or publish; use `agentclash-challenge-pack-validation-publish`.
- The task is to start runs or choose deployments; use `agentclash-eval-runner` or deployment skills.

## Environment
Use hosted production unless the user intentionally points at another backend.

```bash
export AGENTCLASH_API_URL="https://api.agentclash.dev"
```

## CLI Commands
Start from the CLI template when possible, then edit the YAML.

```bash
agentclash challenge-pack init support-eval.yaml --template prompt_eval --name "Support Eval" --slug support-eval
agentclash challenge-pack init native-files.yaml --template native --name "Native Files" --slug native-files
agentclash challenge-pack validate support-eval.yaml
agentclash challenge-pack validate support-eval.yaml --json
```

Human validation output prints `Challenge pack is valid` or `Challenge pack has errors`. Use `--json` when a coding agent needs structured fields such as `valid` and `errors`.

## YAML Skeleton
These are the top-level fields accepted by the bundle parser:

```yaml
pack:
  slug: support-eval
  name: Support Eval
  family: support
  description: Evaluates concise customer support answers.

version:
  number: 1
  execution_mode: prompt_eval
  evaluation_spec:
    name: Support Eval Scoring
    version_number: 1
    judge_mode: deterministic
    validators:
      - key: mentions_refund_policy
        type: contains
        target: final_output
        expected_from: literal:refund policy
    scorecard:
      strategy: weighted
      dimensions:
        - key: correctness
          source: validators
          weight: 1

challenges:
  - key: refund-question
    title: Refund Policy Question
    category: support
    difficulty: easy
    instructions: Answer the customer in a concise, helpful tone.

input_sets:
  - key: smoke
    name: Smoke
    description: Fast validation cases.
    cases:
      - challenge_key: refund-question
        case_key: basic-refund
        payload:
          customer_message: Can I get a refund after 14 days?
        inputs:
          - key: prompt
            kind: text
            value: Can I get a refund after 14 days?
        expectations:
          - key: expected_policy
            kind: text
            source: input:prompt
```

## Required Fields
- `pack.slug`, `pack.name`, and `pack.family` are required.
- `version.number` must be greater than zero.
- `version.execution_mode` should be explicit: use `prompt_eval` or `native`.
- `version.evaluation_spec.validators` must contain at least one validator.
- Every challenge needs `key`, `title`, `category`, and `difficulty`.
- `difficulty` must be `easy`, `medium`, `hard`, or `expert`.
- Every input set needs `key`, `name`, and at least one `cases` entry.
- Every case needs `challenge_key` referencing a declared challenge and a stable `case_key`.
- Use `cases`, not legacy `items`, in new YAML.

## Execution Modes
Use `prompt_eval` for prompt-style tasks where the agent only needs the prompt and final output.

`prompt_eval` cannot use:
- top-level `tools`
- `version.tool_policy`
- `version.sandbox`

Use `native` when the challenge needs files, tools, network policy, package installation, file validators, directory checks, code execution, or sandbox behavior.

## Native Tools And Sandbox
Only include these blocks for `native` packs.

```yaml
tools:
  custom:
    - name: lookup_order
      description: Looks up an order by ID.
      parameters:
        type: object
        properties:
          order_id:
            type: string
        required:
          - order_id
      implementation:
        primitive: http_request
        args:
          method: GET
          url: "https://example.test/orders/${order_id}"
          headers:
            Authorization: "Bearer ${secrets.ORDER_API_KEY}"

version:
  number: 1
  execution_mode: native
  tool_policy:
    allowed_tool_kinds:
      - browser
      - file
      - network
  sandbox:
    network_access: true
    network_allowlist:
      - 203.0.113.0/24
    env_vars:
      DATASET_MODE: fixture
    additional_packages:
      - jq
```

Supported `version.tool_policy.allowed_tool_kinds` values are exactly `browser`, `build`, `data`, `file`, and `network`. Do not use `shell` as an allowed tool kind.

For `tools.custom`, each tool needs `name`, `parameters`, and `implementation`. Non-`mock` implementations need `implementation.primitive` and `implementation.args`. Template placeholders in `args` use `${parameter_name}` for declared parameters and may reference `${secrets.SECRET_KEY}` when the runtime provides that secret; never paste raw secret values into YAML.

Sandbox rules:
- `network_allowlist` entries must be valid CIDR ranges.
- `env_vars` keys must look like shell env names, for example `DATASET_MODE`.
- `additional_packages` entries must be valid apt-style package names.
- Never put raw secrets in YAML. Name required secret keys in notes and configure them through the workspace/runtime/provider flow.

## Assets, Inputs, Expectations, And Artifacts
Assets may appear on `version`, challenges, or cases. Each asset list must use unique `key` values, and every asset needs `path`.

```yaml
version:
  assets:
    - key: policy_pdf
      kind: file
      path: assets/policy.pdf
      media_type: application/pdf

challenges:
  - key: summarize-policy
    title: Summarize Policy
    category: documents
    difficulty: medium
    artifact_refs:
      - key: policy_pdf

input_sets:
  - key: full
    name: Full
    cases:
      - challenge_key: summarize-policy
        case_key: policy-summary
        inputs:
          - key: source_doc
            kind: file
            artifact_key: policy_pdf
            path: assets/policy.pdf
        expectations:
          - key: summary_requirements
            kind: text
            value: Mention refund window and exclusions.
```

Case input fields are `key`, `kind`, optional `value`, optional `artifact_key`, and optional `path`.

Case expectation fields are `key`, `kind`, optional `value`, optional `artifact_key`, and optional `source`. Supported `source` values are empty, `input:<case-input-key>`, or `artifact:<version-asset-key>`, for example `source: input:prompt`.

## Evaluation Spec
`evaluation_spec` controls scoring. Keep deterministic checks deterministic, and use LLM judges only when subjective quality is genuinely needed.

```yaml
evaluation_spec:
  name: Support Eval Scoring
  version_number: 1
  judge_mode: hybrid
  validators:
    - key: has_json
      type: json_schema
      target: final_output
      expected_from: 'literal:{"type":"object","required":["answer"]}'
  llm_judges:
    - key: helpfulness
      mode: rubric
      model: gpt-4.1
      rubric: Judge whether the answer is helpful, grounded, and concise.
      context_from:
        - challenge_input
        - final_output
  scorecard:
    strategy: hybrid
    dimensions:
      - key: schema_gate
        source: validators
        validators:
          - has_json
        weight: 1
        gate: true
        pass_threshold: 1
      - key: helpfulness
        source: llm_judge
        judge_key: helpfulness
        weight: 1
```

Supported `judge_mode` values are `deterministic`, `llm_judge`, and `hybrid`.

Supported validator types include `exact_match`, `contains`, `regex_match`, `json_schema`, `json_path_match`, `boolean_assert`, `fuzzy_match`, `numeric_match`, `normalized_match`, `token_f1`, `math_equivalence`, `bleu_score`, `rouge_score`, `chrf_score`, `file_content_match`, `file_exists`, `file_json_schema`, `directory_structure`, and `code_execution`.

Evidence references accepted by validators and judges include `final_output`, `run.final_output`, `challenge_input`, `case.payload`, `case.payload.<path>`, `case.inputs.<path>`, `case.expectations.<path>`, `artifact.<path>`, `file:<post_execution_check_key>`, and `literal:<value>`.

File validators require a `file:` target. `code_execution` validators must target a `post_execution_checks` entry of type `file_capture`.

For `judge_mode: deterministic`, omit `llm_judges`. For `judge_mode: llm_judge`, include at least one judge. For `judge_mode: hybrid`, include validators and at least one judge; hybrid scorecards need a gated dimension.

For scorecard dimensions with `source: validators`, omit the `validators` list only when the dimension should score every validator. Add `validators: [<validator_key>]` when the dimension should cover a specific subset.

Do not put `${secrets.*}` references in LLM judge `rubric`, `assertion`, or `prompt` text. Secrets are allowed in native tool implementation args when the runtime provides them, but judge prompt text rejects secret references.

## Authoring Procedure
1. Start with `agentclash challenge-pack init ... --template prompt_eval` or `--template native`.
2. Fill `pack` metadata with stable slug/name/family.
3. Set `version.number: 1` for a new pack and choose `execution_mode`.
4. Write challenges before cases so every `case.challenge_key` can reference a real challenge.
5. Add input sets by run purpose: `smoke`, `ci`, `regression`, `full`, or similar.
6. Add deterministic validators first; add LLM judges only when deterministic evidence cannot capture quality.
7. Add native-only tools, sandbox, files, assets, and artifact refs only when the execution mode is `native`.
8. Run validation with and without `--json`.
9. Hand off to publication only after validation passes.

## Common Validation Failures
- Missing `pack.family`, `challenge.category`, or `case_key`.
- Case `challenge_key` does not match any challenge `key`.
- `difficulty` is not one of `easy`, `medium`, `hard`, or `expert`.
- A `prompt_eval` pack includes `tools`, `tool_policy`, or `sandbox`.
- `allowed_tool_kinds` contains unsupported values such as `shell`.
- Asset or artifact reference keys are missing, duplicated, or point at undeclared version assets.
- Case expectation `source` is not empty, `input:<case-input-key>`, or `artifact:<version-asset-key>`.
- File validators do not use a `file:` evidence target.
- `judge_mode` conflicts with the presence or absence of `llm_judges`.

## Report Back Format
```text
YAML file:
Execution mode:
Challenges:
Input sets:
Scoring mode:
Native tools/sandbox/assets:
Validation command:
Validation result:
Ready for publish: <yes/no>
Next skill: agentclash-challenge-pack-validation-publish
Open issues:
```

## Related Skills
- `agentclash-challenge-pack-planner`
- `agentclash-challenge-pack-input-sets`
- `agentclash-challenge-pack-scoring-validators`
- `agentclash-challenge-pack-llm-judges`
- `agentclash-challenge-pack-tools-sandbox`
- `agentclash-challenge-pack-artifacts`
- `agentclash-challenge-pack-validation-publish`
````