# AgentClash Docs Bundle Canonical docs home: https://www.agentclash.dev/docs Machine-readable index: https://www.agentclash.dev/llms.txt This file concatenates the currently shipped AgentClash docs pages and selected product page links into one markdown-oriented bundle for assistants, coding agents, and local retrieval pipelines. ## Public product pages - [AI Agent Evaluation Platform](https://www.agentclash.dev/platform/agent-evaluation) - Public page for real-task AI agent evaluation, replay evidence, scorecards, challenge packs, and CI regression gates. - [AI Agent Regression Testing](https://www.agentclash.dev/platform/agent-regression-testing) - Public page for baseline-versus-candidate agent regression testing, pull request gates, and release evidence. ## Blog posts - [AI Agent Evaluation Needs Regression Testing, Not Just Benchmarks](https://www.agentclash.dev/blog/ai-agent-evaluation-regression-testing) - A practical guide to AI agent evaluation with real-task workloads, replay evidence, scorecards, challenge packs, and CI regression gates. - [Why We Built AgentClash](https://www.agentclash.dev/blog/why-we-built-agentclash) - Static benchmarks leak. Leaderboards reward hype. We built something different. --- # AI Agent Evaluation Needs Regression Testing, Not Just Benchmarks A practical guide to AI agent evaluation with real-task workloads, replay evidence, scorecards, challenge packs, and CI regression gates. Source: https://www.agentclash.dev/blog/ai-agent-evaluation-regression-testing Published: 2026-05-07 Author: Atharva Most AI agent evaluation starts in the wrong place. A team tries a few prompts, compares a model leaderboard, watches one impressive demo, and ships the agent that looked best in a narrow test. Then the agent reaches a real workflow: messy tools, missing context, timeouts, partial files, stale APIs, and users who expect the whole task to finish. That is where benchmark-only evaluation breaks down. Agents are not just text generators. They plan, call tools, modify state, inspect results, recover from mistakes, and decide when to stop. If the eval only checks the final answer, it misses the behavior that makes an agent safe or expensive to run in production. Real AI agent evaluation needs regression testing. ## What an agent eval should prove An agent eval should answer a practical release question: is this agent ready to do this job again, under the same constraints, without getting worse? That means the eval needs more than a score. It needs a repeatable workload, a fair comparison, and enough evidence for a reviewer to understand the result. A useful AI agent evaluation platform should capture: - the task definition and inputs - the tool and network policy - the agent's actions and observations - produced files, logs, and artifacts - correctness, cost, latency, and evidence quality - the comparison between a candidate and a baseline That is the difference between "the model looked good" and "this agent passed the release gate." AgentClash is built around that second workflow. The [AI agent evaluation platform](https://www.agentclash.dev/platform/agent-evaluation) page explains the product surface, but the core idea is simple: run agents on the same real task with the same tools, then preserve replay evidence and scorecards so the result is reviewable. ## Why static benchmarks are not enough Static benchmarks are useful for a first filter. They are not enough for shipping agents. They usually measure isolated answers, not trajectories. They rarely include your private tools, your repository shape, your data contracts, your latency budget, or your failure modes. They can also hide the most important production question: did the agent solve the task in a way your team can trust and repeat? For agents, the path matters. Two agents can produce the same final answer while behaving very differently. One might use the right tool, verify its work, attach the required artifact, and stay inside budget. Another might hallucinate a file, skip the failing test, and still land near the correct prose answer. A final-answer-only benchmark treats those runs as similar. A real agent eval should not. ## Turn failures into challenge packs The repeatable unit in AgentClash is a challenge pack: a workload definition with cases, inputs, tools, scoring rules, and artifacts. Challenge packs make agent evaluation operational because they turn a vague question into something runnable: - What task should the agent perform? - What inputs and fixtures should it see? - Which tools are allowed? - What evidence should be captured? - Which validators or judges decide success? When an agent fails in production or in a release test, the failure should become a reusable case. That is how the eval suite compounds. Instead of debugging the same mistake every few weeks, you promote it into coverage and make the next candidate prove it did not regress. The docs for [writing a challenge pack](https://www.agentclash.dev/docs-md/guides/write-a-challenge-pack) are the right starting point if you want to turn a real workflow into a durable eval. ## Add regression gates to CI The strongest agent eval is not a dashboard someone remembers to check. It is a gate in the release loop. AI agent regression testing compares a candidate run against a known baseline. If the candidate gets worse on correctness, cost, latency, artifacts, or another scorecard dimension, the gate can block the pull request before the change reaches users. That matters because agent quality can regress in subtle ways: - a prompt edit improves one demo but breaks another workflow - a model switch changes tool strategy or latency - a sandbox image update changes installed dependencies - a retrieval change gives the agent stale or incomplete context - a tool permission change makes a previously solved task impossible The [AI agent regression testing](https://www.agentclash.dev/platform/agent-regression-testing) page covers the product angle. The [CI/CD agent gates](https://www.agentclash.dev/docs-md/guides/ci-cd-agent-gates) guide covers the implementation path. ## What to look for in an agent evaluation tool If you are comparing agent evaluation tools, look past the leaderboard. The tool should help your team make a release decision, debug failures, and improve the next test suite. Useful capabilities include: - real-task execution instead of prompt-only grading - sandboxed runs with explicit tool and network policy - replay timelines for tool calls and observations - scorecards that separate correctness, cost, latency, and evidence - artifact capture for files, logs, and outputs - baseline versus candidate comparison - CI gates for regressions - a workflow for promoting failures into reusable tests The goal is not to collect more numbers. The goal is to shorten the path from "this agent failed" to "we understand why, we fixed it, and the failure is now covered." ## The release loop The loop should look like this: 1. Capture a real task as a challenge pack. 2. Run candidate and baseline agents under the same constraints. 3. Inspect replay evidence and scorecards. 4. Promote important failures into regression cases. 5. Gate future changes in CI. That is how AI agent evaluation becomes engineering infrastructure instead of a one-off experiment. Benchmarks can tell you where to look. Regression testing tells you whether the agent is safe to ship again. --- # Why We Built AgentClash Static benchmarks leak. Leaderboards reward hype. We built something different. Source: https://www.agentclash.dev/blog/why-we-built-agentclash Published: 2026-03-23 Author: Atharva Your benchmarks are lying to you. Every team picking an AI model today is doing the same thing: reading someone else's leaderboard, running a few prompts in a playground, and shipping based on vibes. The benchmarks are gamed. The leaderboards reward hype. And you're left guessing. We built AgentClash because we were tired of this. ## The problem Static test sets leak into training data. Crowd-voted rankings measure popularity, not capability. You test agents in isolation, one at a time, and compare scores that were generated under completely different conditions. None of this tells you which model is actually better **for your task**. ## What we're building AgentClash puts your models on the same real task, at the same time. Same tools, same constraints, same environment. Scored live on completion, speed, token efficiency, and tool strategy. Step-by-step replays show exactly why one agent won and another didn't. Every failure gets captured, classified, and turned into a regression test — automatically. The more you run, the smarter your eval suite gets. ## Why opensource Because eval infrastructure shouldn't be a black box. You should be able to see exactly how models are scored, modify the scoring to fit your use case, and run it on your own infra. We're building this in the open. Every commit is public. Every design decision is documented. ## What's next We're in private beta. If you're shipping agents and you're tired of guessing which model to use, [join the waitlist](https://agentclash.dev). Follow the build on [GitHub](https://github.com/agentclash/agentclash). --- # AgentClash Documentation Run agents head-to-head on real tasks, inspect the telemetry, and understand the system without wading through roadmap fiction. Source: https://www.agentclash.dev/docs Markdown export: https://www.agentclash.dev/docs-md AgentClash runs agents against the same task, with the same tools and time budget, then shows you who finished, who stalled, and where the run broke. These docs are layered for three kinds of readers: - evaluators deciding whether the product is worth trying - users who need to configure a workspace and run real comparisons - contributors who want to understand the stack and change it safely The current public surface is still early. This docs pass only covers behavior that is already visible in the repo today: the CLI, the local stack, the current run model, and the main runtime components. Start with the hosted quickstart if you want the shortest path to a real command sequence. Start with self-host if you want the full local stack on your machine. Start with architecture if you are here to hack on the code. For **challenge pack YAML, scoring, tooling, sandboxes, judges, and CLI eval flows**, start at [Challenge pack reference](https://www.agentclash.dev/docs-md/challenge-packs). --- # Hosted Quickstart Validate the CLI against the hosted production backend, set a workspace, and get to your first runnable command in a few minutes. Source: https://www.agentclash.dev/docs/getting-started/quickstart Markdown export: https://www.agentclash.dev/docs-md/getting-started/quickstart This path is for people changing the CLI or trying the product without booting the whole stack locally. > Note: The hosted quickstart assumes your workspace already has challenge packs and > deployments. If it does not, stop after `link` and then author a pack with > `challenge-pack init`; you have still verified auth, connectivity, and > workspace selection. ## 1. Install the CLI ```bash npm i -g agentclash ``` ## 2. Point the CLI at production and log in ```bash export AGENTCLASH_API_URL="https://api.agentclash.dev" agentclash auth login --device ``` Use `--device` when you are in a remote shell or do not want the CLI to open a browser automatically. ## 3. Link a workspace ```bash agentclash link ``` The CLI resolves the API base URL in this order: ```text --api-url > AGENTCLASH_API_URL > saved user config > http://localhost:8080 ``` `agentclash link` saves the selected workspace in user config so later commands do not need raw IDs by default. ## 4. Choose your next path ```bash agentclash doctor agentclash eval start --help ``` If the workspace is already seeded with challenge packs and agent deployments, create and follow a run: ```bash agentclash eval start --follow ``` If the workspace is empty, scaffold a starter pack first: ```bash agentclash challenge-pack init support-eval.yaml agentclash challenge-pack validate support-eval.yaml agentclash challenge-pack publish support-eval.yaml agentclash eval start --follow agentclash baseline set agentclash eval scorecard ``` ## Verification You should now have: - a valid CLI login - a default workspace linked locally - a working connection to the hosted API - either a created run or enough context to see what the workspace is missing - a clear next step: publish a challenge pack, start an eval, or save a baseline ## See also - [Self-Host](https://www.agentclash.dev/docs-md/getting-started/self-host) - [Runs and Evals](https://www.agentclash.dev/docs-md/concepts/runs-and-evals) - [CLI Reference](https://www.agentclash.dev/docs-md/reference/cli) --- # Self-Host Starter Bring up the local AgentClash stack with the repo’s existing scripts and understand which dependencies are mandatory versus optional. Source: https://www.agentclash.dev/docs/getting-started/self-host Markdown export: https://www.agentclash.dev/docs-md/getting-started/self-host This is the shortest honest path to a local AgentClash environment today. It is based on the repo’s existing development scripts, not an imagined one-click installer. > Warning: The repo does not currently ship a Helm chart or a polished production > installer. What it does ship is a local stack script plus documented Railway > deployment building blocks for the backend. ## Prerequisites - Go `1.25+` - Docker - Temporal CLI - Node.js `20+` - `pnpm` - `psql` ## 1. Start the local stack From the repo root: ```bash ./scripts/dev/start-local-stack.sh ``` This script starts PostgreSQL and Redis, applies migrations, launches the Temporal dev server if needed, then starts the API server and worker. Logs are written under `/tmp/agentclash-local-stack/`. ## 2. Start the web app ```bash cd web pnpm install pnpm dev ``` The web app runs at `http://localhost:3000`. ## 3. Seed a runnable fixture Back in the repo root: ```bash ./scripts/dev/seed-local-run-fixture.sh ./scripts/dev/curl-create-run.sh ``` Without a real sandbox provider such as E2B, native runs can still be created, but the model-backed execution path will not complete successfully. ## Required vs optional services - Required: PostgreSQL, Temporal, API server, worker - Optional: Redis for event fanout and rate limiting - Optional: E2B for sandboxed native execution - Optional: S3-compatible storage for production artifact storage ## Production notes The repo’s documented production building blocks today are: - Railway for the API server and worker - Temporal Cloud for orchestration - Vercel for `web/` - S3-compatible storage for artifacts ## Verification You should be able to hit: ```bash curl http://localhost:8080/healthz ``` Then open `http://localhost:3000`. ## See also - [Hosted Quickstart](https://www.agentclash.dev/docs-md/getting-started/quickstart) - [Architecture Overview](https://www.agentclash.dev/docs-md/architecture/overview) - [Contributor Setup](https://www.agentclash.dev/docs-md/contributing/setup) --- # First Eval Walkthrough Use the current seeded local path to create a run, stream events, and inspect ranking output without inventing setup that is not in the repo. Source: https://www.agentclash.dev/docs/getting-started/first-eval Markdown export: https://www.agentclash.dev/docs-md/getting-started/first-eval This walkthrough sticks to what the repo already supports today: seed local data, create a run, stream events, and inspect the result. ## 1. Bring up the local stack From the repo root: ```bash ./scripts/dev/start-local-stack.sh ``` If you want the browser UI too: ```bash cd web pnpm install pnpm dev ``` ## 2. Seed a runnable fixture Back in the repo root: ```bash ./scripts/dev/seed-local-run-fixture.sh ``` That script seeds enough data to create a local run through the API. ## 3. Create the run You can hit the API directly: ```bash ./scripts/dev/curl-create-run.sh ``` Or, if you are using the CLI against a prepared workspace, create and follow the run there: ```bash agentclash eval start --follow ``` `eval start` is the workflow-first wrapper around `run create` — it resolves challenge packs, versions, input sets, and deployments by name or interactive selection. Use `agentclash run create` directly when you want to pass IDs explicitly (CI scripts, automation). ## 4. Inspect the result Once you have a run ID, inspect its status and ranking: ```bash agentclash run get agentclash run ranking ``` If the web app is running, open the workspace run detail view in the browser and inspect the replay and scorecard surfaces from there. ## What you should see - a run record created in the workspace - event streaming during execution when you follow the run - a ranking view once the backend has enough completed run-agent results to score > Warning: Without a real sandbox provider such as E2B, the native model-backed path can > still stall or fail after run creation. That is expected in the unconfigured > local setup. ## See also - [Self-Host Starter](https://www.agentclash.dev/docs-md/getting-started/self-host) - [Runs and Evals](https://www.agentclash.dev/docs-md/concepts/runs-and-evals) - [Architecture Overview](https://www.agentclash.dev/docs-md/architecture/overview) --- # Runs and Evals The product language around runs and evals is easy to blur. The current codebase makes one distinction especially important. Source: https://www.agentclash.dev/docs/concepts/runs-and-evals Markdown export: https://www.agentclash.dev/docs-md/concepts/runs-and-evals A **run** is the concrete execution object you create, stream, rank, compare, and inspect in AgentClash today. In the current user-facing product surface, `run` is the first-class noun: - `agentclash run create` - `agentclash run list` - `agentclash run ranking` - `agentclash compare gate --baseline --candidate ` The workflow-first surface (`agentclash eval start`, `agentclash baseline set`, `agentclash eval scorecard`) wraps these resource commands with name-based selectors and a bookmarked baseline so day-to-day evaluation does not require juggling raw run IDs. The resource commands above remain the canonical ID-centric path for CI and automation. A run is not just one model token stream. It is the container for a scored evaluation attempt inside a workspace, including the challenge pack version, selected agent deployments, lifecycle timestamps, and ranking output. The word **eval** is broader. People use it to mean “the experiment I am trying to run” or “the graded set of results I care about.” That is reasonable, but if you are reading the code or the CLI, you should anchor on this: - **Run** = the concrete resource you create and query. - **Eval** = the broader exercise or outcome you are trying to measure. There are also places in the codebase that refer to eval sessions, but the main shipped workflow today still revolves around runs and ranked run results. If you keep that in your head, the CLI and API are much easier to follow. ## Practical rule of thumb Use **run** when you are talking about a real resource ID. Use **eval** when you are talking about the experiment design or the larger testing loop. ## See also - [Hosted Quickstart](https://www.agentclash.dev/docs-md/getting-started/quickstart) - [CLI Reference](https://www.agentclash.dev/docs-md/reference/cli) --- # Agents and Deployments Understand how AgentClash turns a build plus runtime/provider resources into a concrete deployment that can be scheduled into a run. Source: https://www.agentclash.dev/docs/concepts/agents-and-deployments Markdown export: https://www.agentclash.dev/docs-md/concepts/agents-and-deployments A deployment is the workspace-scoped runnable target that AgentClash can attach to a run. ## Why a deployment exists at all AgentClash is stricter than a typical playground because it has to compare like with like. A model name by itself is not enough. The scheduler needs a concrete object that says: - which build is being run - which build version is current - which runtime policy applies - which provider credentials or model mapping are attached That concrete object is the deployment. ## The current creation contract The current API schema for `CreateAgentDeploymentRequest` requires: - `name` - `agent_build_id` - `build_version_id` - `runtime_profile_id` It also supports these optional fields: - `provider_account_id` - `model_alias_id` - `deployment_config` The OpenAPI description also says only ready build versions can be deployed. ## Runtime profiles are the execution envelope A runtime profile defines how aggressive or constrained execution should be. In the current API and web types, a runtime profile carries fields like: - `execution_target` - `trace_mode` - `max_iterations` - `max_tool_calls` - `step_timeout_seconds` - `run_timeout_seconds` - `profile_config` That last field matters. The native executor reads runtime-profile sandbox overrides from `profile_config`, including things like filesystem roots and `allow_shell` or `allow_network` toggles. The clean mental model is: - the challenge pack defines what the workload wants - the runtime profile defines execution ceilings and local overrides - the deployment binds those choices to a runnable target ## Provider accounts are how credentials enter the system A provider account is a workspace resource with: - `provider_key` - `name` - `credential_reference` - optional `limits_config` The important detail is how credentials are stored. If you create a provider account with a raw `api_key`, the infrastructure manager stores that value as a workspace secret and rewrites the credential reference automatically to: ```text workspace-secret://PROVIDER__API_KEY ``` So the product already prefers indirection over plaintext credentials on the resource itself. ## Model aliases are not just display sugar The user question usually comes out as “provider alias” or “model alias.” In the current product surface, the real resource is `model alias`. A model alias maps a workspace-friendly key to a model catalog entry, and can optionally be tied to a provider account. The current fields are: - `alias_key` - `display_name` - `model_catalog_entry_id` - optional `provider_account_id` That gives you a stable name inside the workspace even if the underlying provider model identifier is ugly or if you need multiple account-specific mappings. ## A deployment is where these pieces come together A good way to think about the chain is: - agent build version: what logic is being deployed - runtime profile: how it is allowed to execute - provider account: which credentials or spend limits back external model calls - model alias: which model selection the deployment should use consistently - deployment: the runnable handle used by runs This is why the docs should not collapse deployment into “selected model.” The object is richer than that. ## What the UI and CLI expose today The current repo already exposes the resource model across multiple surfaces: - CLI `deployment create` and `deployment list` - workspace pages for runtime profiles, provider accounts, model aliases, deployments, secrets, and tools - run creation UI that asks for challenge pack and deployment selection separately That separation is deliberate. A run is an execution event. A deployment is reusable infrastructure state. ## What is stable versus still moving The stable part is the dependency chain and the API surface. The still-moving part is how richly each resource is edited in the UI and how much automation exists around them. So the right docs posture is: - document the current fields and flows precisely - avoid pretending the deployment UX is fully polished - treat the resource model itself as real and important ## See also - [Configure Runtime Resources](../guides/configure-runtime-resources) - [Challenge Packs and Inputs](../concepts/challenge-packs-and-inputs) - [Tools, Network, and Secrets](../concepts/tools-network-and-secrets) - [CLI Reference](../reference/cli) --- # Challenge Packs and Inputs Learn what a challenge pack really is in AgentClash, how the bundle is structured, and how inputs become runnable cases. Source: https://www.agentclash.dev/docs/concepts/challenge-packs-and-inputs Markdown export: https://www.agentclash.dev/docs-md/concepts/challenge-packs-and-inputs A challenge pack is a versioned YAML bundle that defines the workload, scoring contract, execution policy, and input sets for a repeatable evaluation. For field-by-field YAML, scoring enums, judge modes, primitives, sandbox flags, and CLI eval flows, use the **[challenge pack reference hub](../challenge-packs)**—it is written for pack authors who need repo-accurate depth, not a marketing overview. ## What makes it a challenge pack instead of a prompt A challenge pack is not just a task description. In the current repo, a runnable pack carries enough structure for AgentClash to do four jobs consistently: - execute the same workload again later - attach one or more deployments to that workload - score the result using a versioned evaluation spec - preserve the relationship between a failed case and the evidence that exposed it That is why the API does not ask you to start a run with a loose prompt blob. It asks for a `challenge_pack_version_id`. ## The current bundle shape The parser in `backend/internal/challengepack/bundle.go` expects a YAML bundle with these top-level sections: - `pack`: human metadata like `slug`, `name`, and `family` - `version`: the executable version block - `tools`: optional pack-defined composed tools - `challenges`: the workload definitions - `input_sets`: the concrete runnable cases A pack becomes runnable through its `version` block. That block currently carries the load-bearing execution data: - `number`: the pack version number - `execution_mode`: `native` or `prompt_eval` - `tool_policy`: allowed tool kinds and runtime toggles - `filesystem`: optional filesystem constraints - `sandbox`: network, env, package, and template configuration - `evaluation_spec`: the scoring contract - `assets`: version-scoped files or artifact references ## Challenge, input set, case, and asset are different things These terms are easy to blur together. Do not blur them. - challenge pack: the entire versioned bundle - challenge: one task definition inside the bundle - input set: one named collection of runnable cases for that pack version - case: one concrete workload item tied to a challenge via `challenge_key` - asset: a file-like dependency declared by key and path, optionally backed by a stored artifact ID The bundle model in the repo uses `input_sets[].cases[]` as the main execution unit. A case can carry: - `payload` - structured `inputs` - structured `expectations` - `artifacts` - case-local `assets` That makes cases more expressive than a single flat prompt. They can reference files, expected outputs, and evaluator inputs without inventing an ad-hoc schema per benchmark. ## The evaluation spec is part of the pack, not global product config The current evaluation docs are explicit about this. The scoring contract lives inside the pack version’s manifest. That means the pack defines: - validator keys and types - metrics and collectors - runtime limits - pricing rows used for cost scoring - scorecard dimensions and normalization thresholds This matters because AgentClash needs scorecards to remain auditable. When a run is scored, the product can persist the exact `evaluation_spec_id` that was used. The publish response already returns that ID. ## Execution mode matters Two execution modes are visible in the current code and examples: - `prompt_eval`: lighter-weight packs that focus on prompt-style evaluation - `native`: packs that can carry sandbox, tool, and execution policy for richer runs You should choose the simpler mode unless the workload really needs a sandbox, files, or tool execution. ## Sandbox, tool policy, and internet access belong to the pack version This is one of the most important design choices in the repo. The pack version can say what the evaluator is allowed to do: - which tool kinds are allowed - whether shell or network access is enabled - what network CIDRs are allowed - which additional packages should exist in the sandbox - which env vars are injected as literal values In other words, the pack is not only content. It is also policy. ## Assets and artifact-backed packs The version block, challenge blocks, and case blocks can all reference assets. Each asset has a `key` and `path`, and may also carry `media_type`, `kind`, or `artifact_id`. That gives you two useful authoring patterns: - check small fixtures into the pack and refer to them by path - attach previously uploaded workspace artifacts and refer to them by `artifact_id` Validation already checks that asset references are real. If a case or expectation points at an artifact key that was never declared, publish-time validation fails. ## Publish and validate are first-class workflow steps The API and CLI already expose the authoring loop directly: - validate with `POST /v1/workspaces/{workspaceID}/challenge-packs/validate` - publish with `POST /v1/workspaces/{workspaceID}/challenge-packs` - list packs with `GET /v1/workspaces/{workspaceID}/challenge-packs` - list published input sets with `GET /v1/workspaces/{workspaceID}/challenge-pack-versions/{versionID}/input-sets` The CLI mirrors that with: ```bash agentclash challenge-pack validate agentclash challenge-pack publish agentclash challenge-pack list ``` Publish returns more than a pack ID. It returns: - `challenge_pack_id` - `challenge_pack_version_id` - `evaluation_spec_id` - `input_set_ids` - optional `bundle_artifact_id` That tells you the pack bundle is treated as a concrete artifact of record, not just transient YAML. ## How a pack becomes a run A run binds: - one `challenge_pack_version_id` - one or more deployment IDs - optionally one selected input set From there the worker resolves the pack manifest, execution policy, assets, and scoring contract into the runtime path. That is why challenge packs are foundational in AgentClash. They are the unit that makes two runs comparable without hand-waving. ## See also - [Write a Challenge Pack](../guides/write-a-challenge-pack) - [Agents and Deployments](../concepts/agents-and-deployments) - [Tools, Network, and Secrets](../concepts/tools-network-and-secrets) - [Artifacts](../concepts/artifacts) --- # Replay and Scorecards Understand how run events become a readable timeline, a defensible score, and a reusable evidence trail. Source: https://www.agentclash.dev/docs/concepts/replay-and-scorecards Markdown export: https://www.agentclash.dev/docs-md/concepts/replay-and-scorecards Replay is the ordered event history of a run. Scorecards are the condensed judgments and summaries built from that evidence. ## Why AgentClash stores both If you only keep a final score, you lose the explanation. If you only keep raw logs, nobody can compare anything quickly. AgentClash needs both because the product is about arguing from evidence, not from vibes. The canonical event envelope work in the repo makes that boundary explicit. Execution emits structured events. The frontend and downstream analysis layers can replay those events as a timeline. Scorecards then turn the same evidence into something compact enough to rank, filter, and compare. ## Replay is the source of truth Think of replay as the forensic record. It answers questions like: - what happened first - when the agent called a tool - when the sandbox or infrastructure layer failed - when artifacts or outputs were produced - what the final terminal state was That is why replay data should be preserved even when the top-line score looks obvious. The run may still teach you something the score alone cannot show. ## Scorecards are the decision layer A scorecard should make a run legible in seconds. The exact schema will keep evolving, but the purpose is stable: - summarize whether the run passed, failed, or degraded - attach the evidence that justifies that judgment - make comparisons across runs possible without rereading the full trace ## How to use both together The fastest useful workflow is: 1. start with the scorecard to see whether the run is healthy 2. move to the replay timeline to understand why 3. inspect artifacts when the failure is ambiguous or multi-step 4. compare against another run only after you trust the evidence on each side That sequence sounds basic, but it prevents a common failure mode: overreacting to a single score change without checking whether the underlying run actually exercised the same path. ## See also - [Interpret Results](../guides/interpret-results) - [Evidence Loop](../architecture/evidence-loop) - [Challenge Packs and Inputs](../concepts/challenge-packs-and-inputs) - [Data Model](../architecture/data-model) --- # Tools, Network, and Secrets Learn the difference between workspace tools, pack-defined tools, and engine primitives, and how network and secret handling are constrained. Source: https://www.agentclash.dev/docs/concepts/tools-network-and-secrets Markdown export: https://www.agentclash.dev/docs-md/concepts/tools-network-and-secrets AgentClash has more than one “tool” layer. If you do not separate them mentally, the rest of the runtime model gets confusing fast. ## There are three different layers to know ### 1. Workspace tool resources The workspace API already exposes first-class `tools` resources with fields like: - `name` - `tool_kind` - `capability_key` - `definition` - `lifecycle_status` These are infrastructure resources that live alongside runtime profiles, provider accounts, and model aliases. ### 2. Pack-defined composed tools Inside a challenge pack, the optional top-level `tools` block lets a pack author define custom tool interfaces that the evaluated agent can see. Those definitions are pack-local. They are part of the authored benchmark bundle. ### 3. Engine primitives At the bottom are the built-in executor primitives, like `http_request`. These are the concrete operations the runtime knows how to execute safely. A pack-defined tool can delegate to a primitive. That is the key distinction. ## Primitive versus composed tool The current validation code expects composed tools to look roughly like this: ```yaml tools: custom: - name: check_inventory description: Check inventory by SKU parameters: type: object properties: sku: type: string implementation: primitive: http_request args: method: GET url: https://api.example.com/inventory/${sku} headers: Authorization: Bearer ${secrets.INVENTORY_API_KEY} ``` What this means: - `check_inventory` is the tool name the agent sees - `http_request` is the engine primitive that actually runs - `args` is the templated mapping from tool parameters to primitive inputs So when people ask “primitive tools vs actual tools,” the clean answer is: - primitives are built-in executor operations - composed tools are the author-defined tool contracts that delegate to those primitives - workspace tool resources are a separate infrastructure surface ## Validation is strict on purpose The current parser and tests already reject several dangerous or ambiguous cases: - unknown template placeholders like `${missing}` - self-referencing tools where a tool delegates to itself - delegation cycles across composed tools - invalid JSON-schema parameter definitions - missing primitive names or missing args blocks That strictness is good. A benchmark bundle should fail at publish time rather than fail mysteriously at run time. ## Tool kinds are a separate gate from tool names The sandbox policy also carries `allowed_tool_kinds`. That means the pack can say which broad categories are available, for example: - `file` - `shell` - `network` This is different from a specific composed-tool name. A pack might define `check_inventory`, but the runtime still checks whether the underlying kind is allowed. ## Internet access is not automatic The current runtime does not treat network as free ambient capability. There are at least three control points visible in the repo: - the sandbox/tool policy starts with network disabled by default - the pack can enable outbound networking through `sandbox.network_access` and related policy toggles - the `http_request` primitive validates the target URL and CIDR allowlist before making a request The current `http_request.py` helper does all of this: - allows only `http` and `https` - rejects missing hosts - resolves DNS and checks resolved addresses - blocks private, loopback, link-local, reserved, and multicast addresses unless explicitly allowlisted - enforces request and response body limits - sanitizes error handling so secret-bearing values do not leak back to the agent So the current answer to “how can you call the external internet?” is: - use a tool path that ultimately delegates to a network-capable primitive like `http_request` - enable network access in the pack/runtime policy - keep the destination within the permitted network rules ## Secrets live outside the pack The product already exposes workspace-scoped secrets as a first-class surface. You can: - list secret keys - set a secret value - delete a secret The list endpoint intentionally returns metadata only. Secret values never come back out. The CLI surface is: ```bash agentclash secret list agentclash secret set agentclash secret delete ``` ## Where secret references resolve There are two distinct secret-reference patterns in the current code: - `workspace-secret://KEY` for provider credential resolution - `${secrets.KEY}` inside composed-tool argument templates These are not interchangeable. `workspace-secret://KEY` is used when the provider layer resolves account credentials. `${secrets.KEY}` is used during composed-tool argument substitution. The engine then decides whether the target primitive is allowed to receive secret-bearing args. ## Only hardened primitives can accept `${secrets.*}` This is a security boundary, not a convenience feature. The current `primitive_secrets.go` file says only secret-safe primitives may receive `${secrets.*}` substitutions, and today that allowlist intentionally includes only `http_request`. The reason is straightforward: - secrets must not end up in argv - secrets must not land in readable sandbox files - secrets must not come back in response headers or stderr - secrets must not be echoed into the agent context accidentally That is also why sandbox `env_vars` are literal-only. The executor explicitly rejects `${...}` placeholders there, and the code comment tells pack authors to use `http_request` headers instead when remote authentication is needed. ## See also - [Write a Challenge Pack](../guides/write-a-challenge-pack) - [Configure Runtime Resources](../guides/configure-runtime-resources) - [Sandbox Layer](../architecture/sandbox-layer) - [Artifacts](../concepts/artifacts) --- # Artifacts Understand workspace artifacts, pack assets, run evidence files, and how downloads are signed and delivered. Source: https://www.agentclash.dev/docs/concepts/artifacts Markdown export: https://www.agentclash.dev/docs-md/concepts/artifacts An artifact is a stored file object that AgentClash can keep at workspace scope, attach to runs, reference from challenge packs, and expose through signed downloads. ## What an artifact is in the current product The current artifact response shape already tells you the core model: - `workspace_id` - optional `run_id` - optional `run_agent_id` - `artifact_type` - optional `content_type` - optional `size_bytes` - optional `checksum_sha256` - `visibility` - `metadata` - `created_at` That means artifacts are not only run outputs. They can exist before a run and be used as reusable workspace context. ## There are two important artifact roles ### 1. Workspace-managed files The workspace UI and API let you upload arbitrary files to the workspace artifact store. The current artifacts page describes them as files you can: - use as context in challenge packs - attach to runs That is the right mental model. Upload once, then reuse where it makes sense. ### 2. Run evidence files Runs and replay events can also point at artifacts. Those become part of the evidence trail for later inspection, scoring, or failure review. That is why replay and failure-review models carry artifact references. Artifacts are part of the audit trail, not just incidental attachments. ## Challenge-pack assets and artifact refs Challenge packs do not embed giant blobs directly into YAML. They declare assets and then refer to them by key. The bundle model supports assets at multiple levels: - `version.assets` - `challenge.assets` - `case.assets` Each asset can carry: - `key` - `path` - `kind` - `media_type` - optional `artifact_id` Then other parts of the pack can reference those declared assets using: - `artifact_refs` - `artifact_key` - expectation sources like `artifact:` Validation already checks that those references are real. If the key or artifact ID does not resolve, validation fails before publish. ## The published bundle is itself tracked as an artifact When you publish a challenge pack, the response may include `bundle_artifact_id`. That is an important detail because it means the authored pack bundle is treated as a stored object of record. The product does not only store parsed rows; it can also retain the published source bundle as an artifact. ## Upload and download flow The current API surface is: - `GET /v1/workspaces/{workspaceID}/artifacts` - `POST /v1/workspaces/{workspaceID}/artifacts` - `GET /v1/artifacts/{artifactID}/download` - public content route at `/artifacts/{artifactID}/content` The upload path is multipart and supports: - `file` - `artifact_type` - optional `run_id` - optional `run_agent_id` - optional `metadata` The download flow is intentionally indirect. The API returns a signed URL and expiry, then the actual file content is served through the public content endpoint. That keeps raw artifact content behind signed access rather than exposing direct permanent object URLs. ## Visibility and metadata matter Artifacts also carry `visibility` and arbitrary JSON `metadata`. The current UI uses metadata to recover nicer names like `original_filename`. If no filename metadata exists, it falls back to showing the artifact ID prefix. That sounds minor, but it is a sign that metadata is a first-class part of the artifact model, not just a debug dump. ## When to use artifacts versus inline YAML data Use inline bundle data when: - the value is small - it belongs directly in the challenge definition - you want the pack to stay self-contained Use artifacts when: - the file is large or binary - you want reuse across packs or runs - the same file should be downloadable later - the evidence trail should preserve it as a named object ## See also - [Challenge Packs and Inputs](../concepts/challenge-packs-and-inputs) - [Write a Challenge Pack](../guides/write-a-challenge-pack) - [Evidence Loop](../architecture/evidence-loop) - [Data Model](../architecture/data-model) --- # Challenge pack documentation Deep, consumer-facing YAML and runtime reference keyed to the parsers, validators, and workers in this repository. Source: https://www.agentclash.dev/docs/challenge-packs Markdown export: https://www.agentclash.dev/docs-md/challenge-packs These pages complement the short concept guide [Challenge packs and inputs](../concepts/challenge-packs-and-inputs). They spell out everything a benchmark author needs to publish a pack that survives server-side parsing, validation, and execution. Everything here is keyed to shipped code paths—not roadmap language. When behavior changes upstream, validate again with: ```bash agentclash challenge-pack validate your-pack.yaml ``` ## What's covered | Topic | Use when you… | Anchor in repo | | --- | --- | --- | | [Bundle YAML reference](bundle-yaml-reference) | Need the authoritative field list and `prompt_eval` vs `native` rules | `backend/internal/challengepack/bundle.go`, `validation.go` | | [Evaluation spec reference](evaluation-spec-reference) | Choose validator types, wire `target`/`expected_from`, add metrics | `backend/internal/scoring/spec.go`, `validation.go`, `engine_*.go` | | [LLM judges](llm-judges) | Add rubrics, assertions, pairwise comparison, budgets | `backend/internal/scoring/spec.go`, `validation_judges.go` | | [Tools, primitives & policy](tools-primitives-and-policy) | Decide `allowed_tool_kinds`, map composed tools → primitives | `backend/internal/engine/primitive_tools.go`, `tool_registry.go`, `sandbox/sandbox.go` | | [Sandbox & E2B](sandbox-and-e2b) | Tune network_allowlist, template id, sandbox provider | `backend/internal/challengepack/bundle.go`, `sandbox/e2b/`, worker config | | [Input sets & cases](input-sets-and-cases) | Model fixtures, typed inputs and expectations | `challengepack/bundle.go` (`CaseDefinition`), `StoredCaseDocument` | | [Eval workflows & gates](eval-workflows-and-gates) | Chain `eval start`, baselines, scorecards, comparisons | `cli/cmd/eval.go`, `baseline.go`, `compare.go` | ## See also - [Write a challenge pack](../guides/write-a-challenge-pack) — minimal happy-path checklist - [Tools, network, and secrets](../concepts/tools-network-and-secrets) — mental model overview - [Sandbox layer](../architecture/sandbox-layer) — provider boundary explanation --- # Bundle YAML reference Structured reference for challenge-pack bundles as parsed by backend/internal/challengepack. Source: https://www.agentclash.dev/docs/challenge-packs/bundle-yaml-reference Markdown export: https://www.agentclash.dev/docs-md/challenge-packs/bundle-yaml-reference AgentClash challenge packs ship as **one YAML file** decoded into `challengepack.Bundle` (`backend/internal/challengepack/bundle.go`). Publication stores a JSON manifest combining the pack body with metadata for execution and scoring (`ManifestJSON` in the same file). ## Top-level keys All keys listed here are persisted or validated somewhere in publish/validate—not decorative. ### `pack` (required) Human metadata surfaced in API and CLI lists: - `slug` — required, workspace-unique branding string - `name` — required display name - `family` — required grouping/category string - `description` — optional ### `version` (required) The executable spine of the bundle: | Field | Notes | | --- | --- | | `number` | Required positive integer (`int32`). Bumps when you materially change validators, tooling, sandbox, scores, etc. | | `execution_mode` | `native` **or** `prompt_eval`. Empty accepts as legacy but you should always set explicitly. See execution rules below. | | `tool_policy` | Arbitrary-shaped map mirrored into manifest `tool_policy`. **Must be omitted** when `execution_mode` is `prompt_eval` (validated in `ValidateBundle`). | | `filesystem` | Optional filesystem constraints blob (same manifest path as today). | | `sandbox` | Optional `SandboxConfig`. **Forbidden** when `execution_mode` is `prompt_eval`. | | `evaluation_spec` | Required scoring contract unmarshalled through scoring package strict decode (`scoring.StrictDecodeEvaluationSpec`). Typos surface at parse time rather than silently defaulting. | | `assets` | Version-scoped `AssetReference[]` keyed for cases and uploads. | `SandboxConfig` (`bundle.go`) currently supports: - `network_access` (bool) - `network_allowlist` (CIDR strings; invalid CIDR rejects publish) - `env_vars` (map of string literals—see [Sandbox & E2B](sandbox-and-e2b)) - `additional_packages` (APT-style names constrained by regexp in validation) - `sandbox_template_id` (optional provider template override) Manifest JSON merges `sandbox_template_id` from `version.sandbox` into the serialized `version` block for backends that historically keyed off that field separately. ### `tools` (optional) Optional map keyed by integration style; authoring today uses **`tools.custom`** as an array of composed tools (`validation.go`). **Must be empty** when mode is `prompt_eval`. See [Tools, primitives & policy](tools-primitives-and-policy). ### `challenges` (required, non-empty) Each challenge includes: | Field | Required | Purpose | | --- | --- | --- | | `key` | yes | Stable id referenced by cases | | `title` | yes | Display | | `category` | yes | Stored metadata | | `difficulty` | yes | Stored metadata (`easy`/`medium`/… as free text unless your org standardizes it) | | `instructions` | often | Prompt body; mirrored into `definition.instructions` if missing there | | `definition` | optional | Extra JSON-compatible bag for product-specific authoring | | `assets` | optional | Challenge-scoped files | | `artifact_refs` | optional | Artifact key references validated against declared artifacts | ### `input_sets` (required) Defines runnable **cases**: - **`key`** and **`name`** required on each set. - Prefer modern **`cases`** array. Legacy `items` aliases are normalized into `cases` during `normalizeBundle`; do not rely on `items` in new authoring. - All cases in a single input set must reference the same `challenge_key`; use separate input sets for separate challenges. Cases are modeled by `CaseDefinition` with `challenge_key`, `case_key`, optional rich `inputs`/`expectations`, `artifacts`, `assets`, plus legacy **`payload`** for blob-only authoring. Deep dive: [Input sets & cases](input-sets-and-cases). ## Execution mode compatibility Validated in `ValidateBundle`: When `execution_mode` is **`prompt_eval`**: - `tools` block must **not** be present - `version.sandbox` must **not** be present - `version.tool_policy` must be **empty** When **`native`** you may populate sandbox, allowed tool kinds, and composed tools—the worker will hydrate the richer runtime path (`native_executor` flow). Choose `prompt_eval` for pure model-output workloads; promote to `native` when you rely on sandbox files, primitives, validators that read captured files (`file:` evidence), etc. ## After publish — IDs returned Publishing returns stable identifiers referenced by runs (see authoring guide): - `challenge_pack_id` - `challenge_pack_version_id` - `evaluation_spec_id` - `input_set_ids` - Optional `bundle_artifact_id` Run creation binds `challenge_pack_version_id` specifically—YAML filenames stop mattering immediately after publish. ## See also - [Evaluation spec reference](evaluation-spec-reference) - [Challenge packs and inputs — conceptual overview](../concepts/challenge-packs-and-inputs) --- # Evaluation spec reference Validators, metrics, behavioral signals, runtime limits, and scorecard semantics from backend/internal/scoring. Source: https://www.agentclash.dev/docs/challenge-packs/evaluation-spec-reference Markdown export: https://www.agentclash.dev/docs-md/challenge-packs/evaluation-spec-reference Pack-local `evaluation_spec` is unmarshalled into `scoring.EvaluationSpec` (`backend/internal/scoring/spec.go`) via strict decoding. This page maps fields to enums and collectors actually implemented—not aspirational placeholders. For LLM graders see the dedicated guide: [LLM judges](llm-judges). ## Judge mode (`judge_mode`) Top-level discriminator on how scoring composes deterministic pieces with judges: | Value | Constant | | --- | --- | | `deterministic` | `JudgeModeDeterministic` | | `llm_judge` | `JudgeModeLLMJudge` | | `hybrid` | `JudgeModeHybrid` | Invalid values fail validation early. ## Validators (`validators[]`) Each entry is a `ValidatorDeclaration`: - **`key`** — unique within spec; also forbidden to collide with metrics or judges - **`type`** — enumerated `ValidatorType` (snippet below) - **`target`** — evidence reference (validators require supported references—see Evidence references section) - **`expected_from`** — often required depending on validator type (`RequiresExpectedFrom` in `spec.go`) - **`config`** — type-specific strict JSON validated in `validation.go` ### Implemented validator types From `ValidatorType*` constants: `exact_match`, `contains`, `regex_match`, `json_schema`, `json_path_match`, `boolean_assert`, `fuzzy_match`, `numeric_match`, `normalized_match`, `token_f1`, `math_equivalence`, `bleu_score`, `rouge_score`, `chrf_score`, `file_content_match`, `file_exists`, `file_json_schema`, `directory_structure`, `code_execution` File-ish validators gate on sandbox artifacts (see **File validators**: `IsFileValidator()` distinguishes these). Always check `requires_expected_from`: e.g., `file_exists`, `directory_structure`, and `code_execution` can rely on config/paths without `expected_from`. ## Metrics (`metrics[]`) `MetricDeclaration` requires: | Field | Notes | | --- | --- | | `key` | Unique within spec | | `type` | `numeric`, `text`, or `boolean` | | `collector` | Implemented switch in `engine_metrics.go` | | `unit` | Stored for dashboards/score normalization | Collectors wired today (verbatim keys): `run_total_latency_ms`, `run_ttft_ms`, `run_input_tokens`, `run_output_tokens`, `run_total_tokens`, `run_agent_tokens`, `run_race_context_tokens`, `run_model_cost_usd`, `run_completed_successfully`, `run_failure_count`, `run_tool_call_count`, behavioral scores (`behavioral_recovery_score`, … ), `validator_pass_rate` Declaring a collector that does not exist will fail silently only if evidence missing—prefer copying keys from tests in `backend/internal/scoring/engine_metrics.go`. ## Behavioral panel (`behavioral`) Optional `behavioral.signals[]` referencing `behavioral.signal` enums: - `recovery_behavior` - `exploration_efficiency` - `error_cascade` - `scope_adherence` - `confidence_calibration` Each signal supports `weight`, optional `gate`, `pass_threshold` for hardened evaluation sessions. ## Post-execution sandbox captures (`post_execution_checks`) Declare file/directory grabs before sandbox teardown (`post_execution.go`): | `type` | Meaning | | --- | --- | | `file_capture` | Persist file bytes up to configured max | | `directory_listing` | Snapshot structure | Captured evidence is exposed to graders through `file:` style references downstream—pair with validators that target those artifacts. Defaults: ~1 MiB per file, aggregate caps enforced per run (`DefaultMaxFileSizeBytes`, `DefaultMaxTotalCaptureBytes`). ## Scorecard (`scorecard`) `ScorecardDeclaration` holds: ### Dimensions (`dimensions`) Each dimension may be a plain string shorthand (historical compatibility) **or** an expanded object specifying: - **`key`** — dimension name (`correctness`, `latency`, `cost`, `behavioral`, custom) - **`source`** — dispatcher: `validators`, `metric`, `reliability`, `latency`, `cost`, `behavioral`, `llm_judge` - **`validators[]`**, **`metric`**, **`judge_key`** — depending on `source` - **`weight`**, **`normalization`** — linear normalize against target/max envelopes - **`gate`**, **`pass_threshold`** — hard fail semantics (see Strategies) Built-in shortcut keys normalize during `normalizeEvaluationSpec` (`validation.go`): correctness/ reliability/ latency / cost / behavioral auto-fill sensible sources. ### Strategy (`strategy`) | Strategy | Semantics sketch | | --- | --- | | `weighted` | Weighted mean; gated dims may still veto pass verdict | | `binary` | All dimensions treated as gates; scorecard-level `pass_threshold` is rejected (prevents ambiguity) | | `hybrid` | Gates AND aggregate over non-gate dims must clear optional `scorecard.pass_threshold` | See doc comments on `ScoringStrategy` in `spec.go` for nuanced behavior—especially hybrid vs weighted gate interplay. ### `scorecard.pass_threshold` Optional inclusive overall score cutoff (documented extensively in struct comment). Forbidden for pure `binary`. ### Judge budgets (`scorecard.judge_limits`) Caps LLM-as-judge spend per run (`MaxSamplesPerJudge`, `MaxCallsUSD`, `MaxTokens`). Hard-coded ceilings (`JudgeMaxSamplesCeiling`) still clamp pack-authored overrides. ### Legacy normalization (`normalization`) `latency.target_ms`, `cost.target_usd` migrate into dimension-level normalization automatically for older specs—still accepted. ## Runtime limits (`runtime_limits`) `max_total_tokens`, `max_cost_usd`, `max_duration_ms`—enforced upstream of sandbox/model loops; surfaced for UI + scoring fallbacks. ## Pricing (`pricing.models[]`) Pricing rows describing per-million token economics for **`run_model_cost_usd`** normalization. Matches `ProviderKey`/`ProviderModelID` tuples your workspace deployments actually use—misaligned rows produce weak cost dims but do not invalidate the pack. ## Evidence references validators understand Validated by `isSupportedEvidenceReference`: - Absolute shortcuts: `final_output`, `run.final_output`, `challenge_input`, `case.payload` - Dotted accessors: `case.payload.*`, `case.inputs.*`, `case.expectations.*`, `artifact.*` - Sandbox artifacts: prefix `file:` with non-empty remainder - Literals: `literal:…` Prefer explicit paths whenever you refactor input schema—ambiguous references fail validate instead of drifting silently. ## See also - [LLM judges](llm-judges) - [Write a challenge pack](../guides/write-a-challenge-pack) - Historical v0 evaluation contract notes live in the monorepo file `docs/evaluation/challenge-pack-v0.md` (developer-oriented, not mirrored on the docs site). --- # LLM judges LLM-as-judge declarations, rubric/assertion modes, consensus, budgets, evidence wiring—straight from scoring/spec.go and validation_judges.go. Source: https://www.agentclash.dev/docs/challenge-packs/llm-judges Markdown export: https://www.agentclash.dev/docs-md/challenge-packs/llm-judges Agents can be scored by **deterministic validators** alone, pure **LLM judges**, or **`hybrid`** combining both—the tri-state lives in `judge_mode` on `EvaluationSpec` (`backend/internal/scoring/spec.go`). Judge bodies live in **`evaluation_spec.llm_judges[]`** (`LLMJudgeDeclaration`). Dimensions that consume judges set `source: llm_judge` and **`judge_key`** referencing exactly one declaration (1:1 mapping by design). ## Supported grader modes (`mode`) | Mode | Typical use | | --- | --- | | `rubric` | Structured numeric rubric graded each sample | | `assertion` | Yes/no factual checks; aggregates via majority/unanimous | | `n_wise` | Single prompt ranks all competing agents simultaneously | | `reference` | Rubric calibrated against gold text from resolved evidence | `IsNumeric` / `IsBooleanScope` helpers govern which consensus math applies. ## Required fields by mode Validation (`validation_judges.go`) enforces: - **`rubric`** — non-empty rubric string - **`reference`** — rubric + `reference_from` evidence reference (must pass `isSupportedEvidenceReference`) - **`assertion`** — non-empty natural-language assertion; optional `expect` bool flips desired polarity - **`n_wise`** — non-empty ranking `prompt`; optional `position_debiasing` combats ordering bias across samples ## Model fan-out Exactly **one** of: - `model` — single model id string (resolved by worker/provider wiring) - `models` — non-empty list for multi-model judging If `len(models) > 1`, you must include **`consensus`** with: - `aggregation` — `median`, `mean`, `majority_vote`, or `unanimous` - Optional `min_agreement_threshold`, `flag_on_disagreement` Boolean-scope modes restrict some aggregations (assertions cannot be mean-averaged nonsensically—validator enforces compatibility). ## Samples & ceilings - `samples` — per-model repeat count; `0` normalizes to `JudgeDefaultSamples` (3) - Hard cap `JudgeMaxSamplesCeiling` (10) applied even if the pack requests more—cost attack guard ## Evidence conditioning (`context_from[]`) Each entry must be a supported evidence reference (same family as validator `target` strings). The workflow evaluator stitches these fragments into the judge envelope before the LLM call. ## Optional controls | Field | Role | | --- | --- | | `output_schema` | JSON Schema for parser validation of model output | | `score_scale` | `{min,max}` normalization (defaults 1..5 when omitted) | | `anti_gaming_clauses` | Pack-supplied safety copy **appended** to defaults (never replaces base mitigations) | | `timeout_ms` | Per-judge activity budget (clamped by outer Temporal activity timeout) | ## Scorecard wiring 1. Declare judges under `llm_judges`. 2. Add a dimension with `source: llm_judge` and `judge_key` matching a judge `key`. 3. Keep **keys unique across validators, metrics, and judges**—collisions are validation errors (namespace collision prevents ambiguous evidence routing). ## Budgets & cost isolation `scorecard.judge_limits` tracks **judge** spend separately from agent model spend covered by `runtime_limits`. This split is intentional (see Q7 discussion embedded in `JudgeLimits` comments in `spec.go`): agent overages should not hide judge runaway. When cumulative judge calls exceed configured USD/token budgets, remaining samples downgrade to `unable_to_judge` states feeding scorecard `OutputStateUnavailable` paths. ## Practical authoring tips - Start with **`rubric`** + single `model` + default samples; add `models`+`consensus` only after deterministic dims stabilise. - Use **`reference`** when you already store golden answers in `case.expectations` or artifacts—keeps judges aligned to ground truth. - Assertions excel as **binary gates** (`gate: true` on the dimension) while numeric rubrics express partial credit. ## See also - [Evaluation spec reference](evaluation-spec-reference) - [Interpret results](../guides/interpret-results) --- # Tools, primitives & policy How tool_policy and tools.custom map to engine primitives in backend/internal/engine and sandbox.ToolPolicy. Source: https://www.agentclash.dev/docs/challenge-packs/tools-primitives-and-policy Markdown export: https://www.agentclash.dev/docs-md/challenge-packs/tools-primitives-and-policy AgentClash stacks **three** tool notions (also summarized in [Tools, network, and secrets](../concepts/tools-network-and-secrets)): 1. **Workspace tool resources** — org-level infrastructure objects (not covered by pack YAML) 2. **Pack composed tools** — `tools.custom[]` entries expanding to JSON Schema + implementation 3. **Engine primitives** — concrete executors registered in `nativePrimitiveTools` (`backend/internal/engine/primitive_tools.go`) Only (2)+(3) are pack-controlled. ## Tool policy shape `version.tool_policy` JSON eventually hydrates `sandbox.ToolPolicy` (`backend/internal/sandbox/sandbox.go`): - **`allowed_tool_kinds`** — list controlling capability groups - **`allow_shell`** — separate bool gating the `exec` primitive ### Recognized kind strings Validated set (`supportedToolKinds` in `challengepack/validation.go`): `browser`, `build`, `data`, `file`, `network` **Shell is not a kind**—enable it with `allow_shell: true`. ### Empty allowlist semantics `allowsToolKind` treats an **empty** `allowed_tool_kinds` as “allow everything” (per `primitive_helpers.go`). In practice, prefer explicit lists so validation errors catch typos early. ### Mode guardrails `prompt_eval` packs **must omit** `tool_policy` entirely—see [Bundle YAML reference](bundle-yaml-reference). ## Built-in primitive names Declared in `executor_builders.go`, registered in `nativePrimitiveTools`: | Primitive | Gated by | | --- | --- | | `submit` | Always available (final answer) | | `read_file`, `write_file`, `list_files`, `search_files`, `search_text` | `file` kind | | `query_json`, `query_sql` | `data` kind | | `http_request` | `network` kind (+ runtime network flags) | | `run_tests`, `build` | `build` kind | | `exec` | `allow_shell` | Browser tooling exists in policy (`toolKindBrowser`)—ensure your template + worker build includes whatever browser bridge your pack expects before relying on it in production. ## Composed tools (`tools.custom[]`) Each item: ```yaml tools: custom: - name: call_support_api description: Fetch ticket JSON parameters: type: object properties: ticket_id: { type: string } required: [ticket_id] additionalProperties: false implementation: primitive: http_request args: method: GET url: https://api.example.com/tickets/${ticket_id} headers: Authorization: Bearer ${secrets.SUPPORT_TOKEN} ``` Validation highlights (`validateComposedToolConfig`): - Non-mock tools require **`implementation.primitive`** not equal to the composed name (prevents self-delegation footgun) - **`implementation.args`** object required; templates validated for placeholder safety - Parameters must be JSON Schema passing `templateutil.ValidateToolParameterSchema` - Custom graph cannot contain **cycles** or depth > 8 delegation jumps ### Mock implementations Set `implementation.type: mock` to skip primitive resolution—useful for dry-run packs or policy-only testing. Mocks bypass cycle detection. ### Workspace tools vs pack tools Pack tools are **not** the same records as API `tools` resources—they are bundle-local contracts interpreted entirely inside the worker. ## Secret placeholders Composed `args` may reference `${secrets.NAME}` which resolve through workspace secret stores—**never** place secret material inline. Sandbox `env_vars` explicitly reject secret placeholders (see native executor sandbox guard) because environment leaks are too easy; prefer header injection on `http_request`. ## Provider visibility `buildToolRegistry` lifts final OpenAI/Anthropic/etc. tool definitions from the registry’s **visible** map—only tools allowed by policy + manifest appear to the model. ## See also - [Sandbox & E2B](sandbox-and-e2b) for network pairing with `http_request` - `backend/internal/challengepack/tools_validation_test.go` for edge-case fixtures --- # Sandbox & E2B Pack sandbox fields, worker sandbox provider selection, and how native execution reaches E2B today. Source: https://www.agentclash.dev/docs/challenge-packs/sandbox-and-e2b Markdown export: https://www.agentclash.dev/docs-md/challenge-packs/sandbox-and-e2b Native runs execute agent tool calls inside an isolated **sandbox provider** implementing `sandbox.Provider` (`backend/internal/sandbox/sandbox.go`). Production commonly uses **E2B** (`backend/internal/sandbox/e2b/provider.go`); local development may run **unconfigured** no-op providers so queues drain without real VMs. ## Pack-level `version.sandbox` Struct `SandboxConfig` (`challengepack/bundle.go`): | Field | Purpose | | --- | --- | | `network_access` | Boolean gate paired with tool policy network tools | | `network_allowlist` | CIDR strings; invalid entries fail `ValidateBundle` | | `env_vars` | Literal environment injection into sandbox (secret placeholders rejected during native executor setup) | | `additional_packages` | APT package names (`aptPackagePattern` validation) | | `sandbox_template_id` | Optional override of default E2B template id per pack version | **Remember:** entire `sandbox` block is illegal for `prompt_eval` packs. ## Worker configuration knobs From environment / `backend/internal/worker/config.go` (mirror of the searchable [Config reference](../reference/config) tables): | Variable | Effect | | --- | --- | | `SANDBOX_PROVIDER` | `e2b` vs `unconfigured` (noop) | | `E2B_API_KEY` | Credentials | | `E2B_TEMPLATE_ID` | Default template when pack omits `sandbox_template_id` | | `E2B_API_BASE_URL` | Optional API override | | `E2B_REQUEST_TIMEOUT` | HTTP budget for control-plane calls | Misconfiguration does not rewrite your YAML—it causes clear worker errors or `/doctor` warnings when local sandboxes cannot start. ## Tool policy vs network flags Even if `http_request` is allowed by `allowed_tool_kinds: [network]`, outbound traffic still respects: - global sandbox network toggles - CIDR allowlists - provider-level enforcement inside E2B machines Think of **tool policy** as “model may ask” and **sandbox** as “infrastructure may permit”. ## Secrets & environment Native executor refuses `${secrets.*}` inside `sandbox.env_vars` because files and process listings could leak them; keep secrets in tool args that go through hardened paths (notably `http_request` header sanitation—see `primitive_secrets.go` comments). ## Failure modes to expect - **Template drift** — changing `additional_packages` without rebuilding templates can cause first-run apt noise; pin templates when stable. - **Allowlist too tight** — model receives policy errors from `http_request` if DNS resolves but CIDR blocks egress. - **No provider** — `SANDBOX_PROVIDER=unconfigured` means native runs **do not** execute real tools; useful for API-only integration tests, misleading if you expect live sandboxes. ## See also - [Architecture — Sandbox layer](../architecture/sandbox-layer) - [Tools, primitives & policy](tools-primitives-and-policy) --- # Input sets & cases How cases bind challenges, structured inputs, expectations, assets, and legacy payloads—grounded in challengepack.CaseDefinition. Source: https://www.agentclash.dev/docs/challenge-packs/input-sets-and-cases Markdown export: https://www.agentclash.dev/docs-md/challenge-packs/input-sets-and-cases Input sets are the unit AgentClash schedules per deployment/candidate. Each `input_sets[]` entry contains **`cases[]`** (`CaseDefinition` in `backend/internal/challengepack/bundle.go`). ## Case identity - **`challenge_key`** — must reference an existing `challenges[].key` - **`case_key`** / legacy **`item_key`** — both accepted; normalization duplicates missing side from the other All cases in one `input_sets[]` entry must reference the same `challenge_key`; split mixed-challenge suites into separate input sets. `EffectiveKey()` chooses `case_key` when present for stored rows. ## Three authoring styles (coexist) 1. **Legacy payload-only** — fill `payload` map; omit structured inputs/expectations 2. **Structured eval** — `inputs[]` + `expectations[]` with explicit `kind` fields 3. **Artifact heavy** — `assets[]` + `artifacts[]` referencing declared version/challenge assets `IsLegacyPayloadOnly` detects style (1) for storage compatibility. ### Stored document shape When modern fields exist, `StoredPayload()` marshals `StoredCaseDocument` JSON with `schema_version: 1`, preserving: - `payload` - `inputs` - `expectations` - `artifacts` - `assets` This is what scoring + replay pull back—not the raw YAML fragment. ## Case inputs (`inputs[]`) `CaseInput` fields: | Field | Role | | --- | --- | | `key` | Stable id for templates / UI | | `kind` | Drives rendering + validator binding (`text`, `artifact`, etc.—product-specific kinds should match worker expectations) | | `value` | Inline scalar/object | | `artifact_key` | Pull bytes from declared asset map | | `path` | Optional relative path inside asset bundle | Validators can address values through `case.inputs.` evidence paths. ## Expectations (`expectations[]`) `CaseExpectation` parallels inputs: - `key`, `kind`, `value`, `artifact_key`, plus **`source`** telling graders where dynamic gold values originate (`input:prompt` pattern seen in CLI template packs) Use expectations for: - deterministic string compares - supplying LLM judge `reference_from` bindings - filesystem validators comparing outputs to expected files ## Assets on cases Case-level `assets[]` references use the same `AssetReference` structure as version-level entries (key, path, optional `artifact_id`). Validation ensures cross-references exist before publish succeeds. ## Input set metadata Optional `description` on an input set is preserved for UI/discovery; there is no behavioral magic—selection happens by id/key at run creation time. ## Choosing input set at run time CLI `eval start` accepts `--input-set` when multiple sets exist; otherwise TTY flows prompt. API consumers pass the chosen `input_set_id` when creating runs (see OpenAPI `CreateRun` family). ## See also - [Bundle YAML reference](bundle-yaml-reference) - [Evaluation spec — evidence references](evaluation-spec-reference) - [Artifacts concept](../concepts/artifacts) --- # Eval workflows & gates CLI-first eval commands, baselines, scorecards, comparisons, and release gates as implemented in cli/cmd. Source: https://www.agentclash.dev/docs/challenge-packs/eval-workflows-and-gates Markdown export: https://www.agentclash.dev/docs-md/challenge-packs/eval-workflows-and-gates Challenge packs are useless until a **run** binds a `challenge_pack_version_id` to one or more **deployments**. The product ships a workflow-oriented CLI path so you rarely hand-copy UUIDs. ## Happy path commands From `cli/cmd/eval.go`, `baseline.go`, `compare.go`, `release_gate.go`: ```bash agentclash eval start --follow agentclash baseline set [run_id] [--agent