Challenge packs
LLM judges
LLM-as-judge declarations, rubric/assertion modes, consensus, budgets, evidence wiring—straight from scoring/spec.go and validation_judges.go.
Agents can be scored by deterministic validators alone, pure LLM judges, or hybrid combining both—the tri-state lives in judge_mode on EvaluationSpec (backend/internal/scoring/spec.go).
Judge bodies live in evaluation_spec.llm_judges[] (LLMJudgeDeclaration). Dimensions that consume judges set source: llm_judge and judge_key referencing exactly one declaration (1:1 mapping by design).
Supported grader modes (mode)
| Mode | Typical use |
| --- | --- |
| rubric | Structured numeric rubric graded each sample |
| assertion | Yes/no factual checks; aggregates via majority/unanimous |
| n_wise | Single prompt ranks all competing agents simultaneously |
| reference | Rubric calibrated against gold text from resolved evidence |
IsNumeric / IsBooleanScope helpers govern which consensus math applies.
Required fields by mode
Validation (validation_judges.go) enforces:
rubric— non-empty rubric stringreference— rubric +reference_fromevidence reference (must passisSupportedEvidenceReference)assertion— non-empty natural-language assertion; optionalexpectbool flips desired polarityn_wise— non-empty rankingprompt; optionalposition_debiasingcombats ordering bias across samples
Model fan-out
Exactly one of:
model— single model id string (resolved by worker/provider wiring)models— non-empty list for multi-model judging
If len(models) > 1, you must include consensus with:
aggregation—median,mean,majority_vote, orunanimous- Optional
min_agreement_threshold,flag_on_disagreement
Boolean-scope modes restrict some aggregations (assertions cannot be mean-averaged nonsensically—validator enforces compatibility).
Samples & ceilings
samples— per-model repeat count;0normalizes toJudgeDefaultSamples(3)- Hard cap
JudgeMaxSamplesCeiling(10) applied even if the pack requests more—cost attack guard
Evidence conditioning (context_from[])
Each entry must be a supported evidence reference (same family as validator target strings). The workflow evaluator stitches these fragments into the judge envelope before the LLM call.
Optional controls
| Field | Role |
| --- | --- |
| output_schema | JSON Schema for parser validation of model output |
| score_scale | {min,max} normalization (defaults 1..5 when omitted) |
| anti_gaming_clauses | Pack-supplied safety copy appended to defaults (never replaces base mitigations) |
| timeout_ms | Per-judge activity budget (clamped by outer Temporal activity timeout) |
Scorecard wiring
- Declare judges under
llm_judges. - Add a dimension with
source: llm_judgeandjudge_keymatching a judgekey. - Keep keys unique across validators, metrics, and judges—collisions are validation errors (namespace collision prevents ambiguous evidence routing).
Budgets & cost isolation
scorecard.judge_limits tracks judge spend separately from agent model spend covered by runtime_limits. This split is intentional (see Q7 discussion embedded in JudgeLimits comments in spec.go): agent overages should not hide judge runaway.
When cumulative judge calls exceed configured USD/token budgets, remaining samples downgrade to unable_to_judge states feeding scorecard OutputStateUnavailable paths.
Practical authoring tips
- Start with
rubric+ singlemodel+ default samples; addmodels+consensusonly after deterministic dims stabilise. - Use
referencewhen you already store golden answers incase.expectationsor artifacts—keeps judges aligned to ground truth. - Assertions excel as binary gates (
gate: trueon the dimension) while numeric rubrics express partial credit.