AgentClash

Pack-local evaluation_spec is unmarshalled into scoring.EvaluationSpec (backend/internal/scoring/spec.go) via strict decoding. This page maps fields to enums and collectors actually implemented—not aspirational placeholders.

For LLM graders see the dedicated guide: LLM judges.

Judge mode (`judge_mode`)

Top-level discriminator on how scoring composes deterministic pieces with judges:

| Value | Constant | | --- | --- | | deterministic | JudgeModeDeterministic | | llm_judge | JudgeModeLLMJudge | | hybrid | JudgeModeHybrid |

Invalid values fail validation early.

Validators (`validators[]`)

Each entry is a ValidatorDeclaration:

key — unique within spec; also forbidden to collide with metrics or judges
type — enumerated ValidatorType (snippet below)
target — evidence reference (validators require supported references—see Evidence references section)
expected_from — often required depending on validator type (RequiresExpectedFrom in spec.go)
config — type-specific strict JSON validated in validation.go

Implemented validator types

From ValidatorType* constants:

exact_match, contains, regex_match, json_schema, json_path_match, boolean_assert, fuzzy_match, numeric_match, normalized_match, token_f1, math_equivalence, bleu_score, rouge_score, chrf_score, file_content_match, file_exists, file_json_schema, directory_structure, code_execution

File-ish validators gate on sandbox artifacts (see File validators: IsFileValidator() distinguishes these).

Always check requires_expected_from: e.g., file_exists, directory_structure, and code_execution can rely on config/paths without expected_from.

Metrics (`metrics[]`)

MetricDeclaration requires:

| Field | Notes | | --- | --- | | key | Unique within spec | | type | numeric, text, or boolean | | collector | Implemented switch in engine_metrics.go | | unit | Stored for dashboards/score normalization |

Collectors wired today (verbatim keys):

run_total_latency_ms, run_ttft_ms, run_input_tokens, run_output_tokens, run_total_tokens, run_agent_tokens, run_race_context_tokens, run_model_cost_usd, run_completed_successfully, run_failure_count, behavioral scores (behavioral_recovery_score, … ), validator_pass_rate

Declaring a collector that does not exist will fail silently only if evidence missing—prefer copying keys from tests in backend/internal/scoring/engine_metrics.go.

Behavioral panel (`behavioral`)

Optional behavioral.signals[] referencing behavioral.signal enums:

recovery_behavior
exploration_efficiency
error_cascade
scope_adherence
confidence_calibration

Each signal supports weight, optional gate, pass_threshold for hardened evaluation sessions.

Post-execution sandbox captures (`post_execution_checks`)

Declare file/directory grabs before sandbox teardown (post_execution.go):

| type | Meaning | | --- | --- | | file_capture | Persist file bytes up to configured max | | directory_listing | Snapshot structure |

Captured evidence is exposed to graders through file:<path> style references downstream—pair with validators that target those artifacts.

Defaults: ~1 MiB per file, aggregate caps enforced per run (DefaultMaxFileSizeBytes, DefaultMaxTotalCaptureBytes).

Scorecard (`scorecard`)

ScorecardDeclaration holds:

Dimensions (`dimensions`)

Each dimension may be a plain string shorthand (historical compatibility) or an expanded object specifying:

key — dimension name (correctness, latency, cost, behavioral, custom)
source — dispatcher: validators, metric, reliability, latency, cost, behavioral, llm_judge
validators[], metric, judge_key — depending on source
weight, normalization — linear normalize against target/max envelopes
gate, pass_threshold — hard fail semantics (see Strategies)

Built-in shortcut keys normalize during normalizeEvaluationSpec (validation.go): correctness/ reliability/ latency / cost / behavioral auto-fill sensible sources.

Strategy (`strategy`)

| Strategy | Semantics sketch | | --- | --- | | weighted | Weighted mean; gated dims may still veto pass verdict | | binary | All dimensions treated as gates; scorecard-level pass_threshold is rejected (prevents ambiguity) | | hybrid | Gates AND aggregate over non-gate dims must clear optional scorecard.pass_threshold |

See doc comments on ScoringStrategy in spec.go for nuanced behavior—especially hybrid vs weighted gate interplay.

`scorecard.pass_threshold`

Optional inclusive overall score cutoff (documented extensively in struct comment). Forbidden for pure binary.

Judge budgets (`scorecard.judge_limits`)

Caps LLM-as-judge spend per run (MaxSamplesPerJudge, MaxCallsUSD, MaxTokens). Hard-coded ceilings (JudgeMaxSamplesCeiling) still clamp pack-authored overrides.

Legacy normalization (`normalization`)

latency.target_ms, cost.target_usd migrate into dimension-level normalization automatically for older specs—still accepted.

Runtime limits (`runtime_limits`)

max_total_tokens, max_cost_usd, max_duration_ms—enforced upstream of sandbox/model loops; surfaced for UI + scoring fallbacks.

Pricing (`pricing.models[]`)

Pricing rows describing per-million token economics for run_model_cost_usd normalization. Matches ProviderKey/ProviderModelID tuples your workspace deployments actually use—misaligned rows produce weak cost dims but do not invalidate the pack.

Evidence references validators understand

Validated by isSupportedEvidenceReference:

Absolute shortcuts: final_output, run.final_output, challenge_input, case.payload
Dotted accessors: case.payload.*, case.inputs.*, case.expectations.*, artifact.*
Sandbox artifacts: prefix file: with non-empty remainder
Literals: literal:…

Prefer explicit paths whenever you refactor input schema—ambiguous references fail validate instead of drifting silently.

Evaluation spec reference

Judge mode (`judge_mode`)

Validators (`validators[]`)

Implemented validator types

Metrics (`metrics[]`)

Behavioral panel (`behavioral`)

Post-execution sandbox captures (`post_execution_checks`)

Scorecard (`scorecard`)

Dimensions (`dimensions`)

Strategy (`strategy`)

`scorecard.pass_threshold`

Judge budgets (`scorecard.judge_limits`)

Legacy normalization (`normalization`)

Runtime limits (`runtime_limits`)

Pricing (`pricing.models[]`)

Evidence references validators understand

See also

Evaluation spec reference

Judge mode (judge_mode)

Validators (validators[])

Implemented validator types

Metrics (metrics[])

Behavioral panel (behavioral)

Post-execution sandbox captures (post_execution_checks)

Scorecard (scorecard)

Dimensions (dimensions)

Strategy (strategy)

scorecard.pass_threshold

Judge budgets (scorecard.judge_limits)

Legacy normalization (normalization)

Runtime limits (runtime_limits)

Pricing (pricing.models[])

Evidence references validators understand

See also

Judge mode (`judge_mode`)

Validators (`validators[]`)

Metrics (`metrics[]`)

Behavioral panel (`behavioral`)

Post-execution sandbox captures (`post_execution_checks`)

Scorecard (`scorecard`)

Dimensions (`dimensions`)

Strategy (`strategy`)

`scorecard.pass_threshold`

Judge budgets (`scorecard.judge_limits`)

Legacy normalization (`normalization`)

Runtime limits (`runtime_limits`)

Pricing (`pricing.models[]`)