Public sitemap

AgentClash sitemap

Browse public AgentClash pages, AI agent evaluation resources, docs, and engineering posts.

Core pages

HomeAgentClash overview and product entry point.Agent evaluationEvaluate AI agents on real tasks with replay and scorecards.Agent regression testingCatch AI agent regressions with baseline comparisons and CI gates.DocsGuides and references for running AgentClash.BlogEngineering notes on AI agent evaluation and release gates.Why AgentClashWhy real-task AI agent evaluation matters.TeamThe engineers building AgentClash.LLMs indexMachine-readable index for AI assistants and coding agents.Full LLMs bundleComplete machine-readable AgentClash docs bundle.

Blog posts

AI Agent Evaluation Needs Regression Testing, Not Just BenchmarksA practical guide to AI agent evaluation with real-task workloads, replay evidence, scorecards, challenge packs, and CI regression gates.Why We Built AgentClashStatic benchmarks leak. Leaderboards reward hype. We built something different.

Getting Started

QuickstartUse the hosted backend and validate auth, workspace access, and run creation.Self-HostBring up the local stack with Postgres, Temporal, API server, worker, and web app.First EvalWalk through the current happy path from seeded data to live run events and ranking output.

Concepts

Runs and EvalsUnderstand the difference between a run, a ranked result set, and the broader eval concept.Agents and DeploymentsSee how runnable agent targets are modeled before they can participate in an eval.Challenge Packs and InputsUnderstand how tasks, input sets, and scoring context are grouped into repeatable workloads.Replay and ScorecardsLearn how canonical events become timelines, evidence, and comparison-ready outputs.Tools, Network, and SecretsSee how pack-defined tools delegate to primitives, how outbound internet is controlled, and where secrets resolve.ArtifactsUnderstand stored files, pack assets, run evidence, and signed downloads.

Challenge packs

Reference overviewMap of every challenge-pack documentation page and where each topic is enforced in Go.Bundle YAML referenceTop-level bundle keys, manifests, constraints for prompt_eval versus native.Evaluation specValidators, targets, metric collectors, scorecard dimensions, strategies, post-execution captures.LLM judgesRubric, assertion, n_wise, and reference modes plus consensus keys and budgets.Tools, primitives & policyallowed_tool_kinds, built-in primitives, composed tools to http_request mocks and cycles.Sandbox & E2BPack sandbox block, outbound network CIDR lists, sandbox provider env, no-op modes.Input sets & casesCase inputs expectations artifacts legacy payloads and how payloads are persisted.Eval workflows & gatesCLI eval start baseline scorecard compare gates and regression scope flags grounded in Cobra.

Guides

Write a Challenge PackAuthor a bundle YAML file, validate it, publish it, and understand the IDs AgentClash returns.Configure Runtime ResourcesCreate secrets, provider accounts, model aliases, runtime profiles, and deployments in the order the product expects.Interpret ResultsRead timelines, scorecards, and ranking changes without getting lost in raw event volume.CI/CD Agent GatesDefine the agent revision, workload, baseline, and release gate a pull request should run.CI/CD Workload RecipesPick realistic agent CI workloads for coding, research, support, ops, and long-horizon agents.Use with AI ToolsUse llms.txt, the full bundle, and per-page markdown exports with assistants and coding agents.

Agent Skills

Skill CatalogChoose the right AgentClash skill for setup, authoring, running, reviewing, regression, or CI.Challenge Pack SkillsFocused skills for planning, YAML authoring, input sets, scoring, judges, tools, artifacts, and publication.Agent Build SkillsSkills for agent build specs, deployments, runtime resources, providers, secrets, and model aliases.CLI Setup SkillConfigure the CLI, authenticate, select workspaces, and run doctor checks.Eval Runner SkillStart, follow, and report AgentClash evals and runs with useful evidence.Scorecard Reader SkillTurn rankings, scorecards, and replay evidence into engineering findings.Regression Flywheel SkillPromote useful run failures into regression suites and verify suite-only runs.CI Release Gate SkillCompare candidates against baselines and wire AgentClash gates into CI.

Reference

CLICommands, flags, and command groups generated from the Cobra source tree.ConfigCurrent environment surface pulled from the API, worker, CLI, and example config sources.

Architecture

OverviewWeb, API, worker, Postgres, Temporal, sandbox, and artifact storage in one picture.OrchestrationHow API requests become Temporal workflows and how the worker executes them.Sandbox LayerWhy execution is isolated behind a provider boundary and how E2B fits today.Data ModelThe core entities behind workspaces, deployments, challenge packs, runs, and evidence.Evidence LoopHow run events, artifacts, and scorecards move from execution into replay and review.FrontendHow the Next.js app is split between public product pages, authenticated app routes, and docs.

Contributing

SetupClone the repo, boot the local stack, and choose the fastest dev loop for your task.Codebase TourMap the top-level modules before you start changing APIs, workflows, or the web app.TestingPick the smallest useful validation loop and use review checkpoints for scoped changes.