Documentation

AgentClash Documentation

Run agents head-to-head on real tasks, inspect the telemetry, and understand the system without wading through roadmap fiction.

AgentClash runs agents against the same task, with the same tools and time budget, then shows you who finished, who stalled, and where the run broke.

These docs are layered for three kinds of readers:

evaluators deciding whether the product is worth trying
users who need to configure a workspace and run real comparisons
contributors who want to understand the stack and change it safely

The current public surface covers behavior visible in the repo today: the CLI, the local stack, datasets and regression gates, multi-turn human takeover, security stress harnesses, and the main runtime components.

Start with the hosted quickstart if you want the shortest path to a real command sequence. Start with self-host if you want the full local stack on your machine. Start with architecture if you are here to hack on the code. For challenge pack YAML, scoring, tooling, sandboxes, judges, and CLI eval flows, start at Challenge pack reference.

New surfaces

Datasets overview — pinned evals, baselines, and CI gates
Multi-turn packs — human takeover and calibration
Security evaluation — stress-run security packs and posture scoring

Runs and EvalsUnderstand the difference between a run, a ranked result set, and the broader eval concept.Agents and DeploymentsSee how runnable agent targets are modeled before they can participate in an eval.Challenge Packs and InputsUnderstand how tasks, input sets, and scoring context are grouped into repeatable workloads.Replay and ScorecardsLearn how canonical events become timelines, evidence, and comparison-ready outputs.Tools, Network, and SecretsSee how pack-defined tools delegate to primitives, how outbound internet is controlled, and where secrets resolve.ArtifactsUnderstand stored files, pack assets, run evidence, and signed downloads.Try CLIInteractive disposable terminal demos for README badges — try CLIs before install.Voice Artifact ContractsUse generic audio, timing, sync, and media reports to evaluate voice agents across providers.

Challenge packs

YAML reference grounded in backend/parser/scoring/enforcement paths—meant for pack authors publishing real workloads.

Reference overviewMap of every challenge-pack documentation page and where each topic is enforced in Go.Bundle YAML referenceTop-level bundle keys, manifests, constraints for prompt_eval versus native.Evaluation specValidators, targets, metric collectors, scorecard dimensions, strategies, post-execution captures.LLM judgesRubric, assertion, n_wise, and reference modes plus consensus keys and budgets.Tools, primitives & policyallowed_tool_kinds, built-in primitives, composed tools to http_request mocks and cycles.Sandbox & E2BPack sandbox block, outbound network CIDR lists, sandbox provider env, no-op modes.Input sets & casesCase inputs expectations artifacts legacy payloads and how payloads are persisted.Multi-turn packsScripted, LLM, and human user-simulator phases with operator APIs and calibration reviews.Eval workflows & gatesCLI eval start baseline scorecard compare gates and regression scope flags grounded in Cobra.

Guides

Task-oriented walkthroughs for authoring packs, setting up deployments, reading results, and using the docs with AI tools.

Write a Challenge PackAuthor a bundle YAML file, validate it, publish it, and understand the IDs AgentClash returns.Configure Runtime ResourcesCreate secrets, provider accounts, model aliases, runtime profiles, and deployments in the order the product expects.Interpret ResultsRead timelines, scorecards, and ranking changes without getting lost in raw event volume.CI/CD Agent GatesDefine the agent revision, workload, baseline, and release gate a pull request should run.Datasets overviewPinned dataset evals, baselines, regression suite sync, and the dataset command lifecycle.Synthetic dataset generationFast Self-Instruct and Agentic Self-Instruct weak-vs-strong generation in workspaces, plus DataSmith export.Dataset CI GatesRecord dataset eval baselines, sync examples into regression suites, and gate CI with agentclash dataset test.Security evaluationStress-run security packs, score secret leaks and adversarial acceptance, and measure posture.CI/CD Workload RecipesPick realistic agent CI workloads for coding, research, support, ops, and long-horizon agents.Use with AI ToolsUse llms.txt, the full bundle, and per-page markdown exports with assistants and coding agents.

Agent Skills

Copyable AgentClash workflows that coding agents can install or fetch as markdown.

Skill CatalogChoose the right AgentClash skill for setup, authoring, running, reviewing, regression, or CI.Hub SkillStart here for workflow map, dependency order, UI links, and pointers to every AgentClash skill.Quickstart SkillRun readiness checks and get the suggested next CLI command before evals.Challenge Pack SkillsFocused skills for planning, YAML authoring, input sets, scoring, judges, tools, artifacts, and publication.Agent Build SkillsSkills for agent build specs, deployments, runtime resources, providers, secrets, and model aliases.CLI Setup SkillConfigure the CLI, authenticate, select workspaces, and run doctor checks.Eval Runner SkillStart, follow, and report AgentClash evals and runs with useful evidence.Scorecard Reader SkillTurn rankings, scorecards, and replay evidence into engineering findings.Compare And Triage SkillManage baselines, compare runs, evaluate gates, and build replay triage envelopes.Regression Flywheel SkillPromote useful run failures into regression suites and verify suite-only runs.CI Release Gate SkillCompare candidates against baselines and wire AgentClash gates into CI.Agent Harness Setup SkillCreate and run Agent Harness coding tasks, suites, executions, and failure review.Multi Turn Operator SkillSubmit human operator messages when multi_turn run agents await input.Dataset Workflows SkillManage datasets, eval gates, synthetic generation, traces, and regression sync.Prompt Eval Playground SkillScaffold, validate, and run prompt eval configs and playground experiments.Workspace Admin SkillAdminister organizations, workspaces, and membership beyond basic CLI login.Security Evaluation SkillRun client-side security stress harnesses against security challenge packs.

Reference

Reference surfaces generated from current source readers where possible.

CLICommands, flags, and command groups generated from the Cobra source tree.ConfigCurrent environment surface pulled from the API, worker, CLI, and example config sources.

Architecture

System boundaries, runtime components, and why the stack is shaped this way.

OverviewWeb, API, worker, Postgres, Temporal, sandbox, and artifact storage in one picture.OrchestrationHow API requests become Temporal workflows and how the worker executes them.Sandbox LayerWhy execution is isolated behind a provider boundary and how E2B fits today.Data ModelThe core entities behind workspaces, deployments, challenge packs, runs, and evidence.Evidence LoopHow run events, artifacts, and scorecards move from execution into replay and review.FrontendHow the Next.js app is split between public product pages, authenticated app routes, and docs.

Contributing

Get the repo running and understand where to start making changes.

SetupClone the repo, boot the local stack, and choose the fastest dev loop for your task.Codebase TourMap the top-level modules before you start changing APIs, workflows, or the web app.TestingPick the smallest useful validation loop and use review checkpoints for scoped changes.

Docs FAQ

Where should I start with AgentClash?

Start with the quickstart, then write a challenge pack for one real agent workload and run it locally or against a hosted workspace.

Can AgentClash docs help with CI agent gates?

Yes. The docs cover challenge packs, scorecards, baseline comparisons, and CI/CD gates for catching AI agent regressions before release.

Are the docs available for coding agents?

Yes. AgentClash publishes llms.txt, llms-full.txt, and per-page markdown exports so coding agents can read the docs directly.