Challenge packs
LLM judges
LLM-as-judge declarations, rubric/assertion modes, consensus, budgets, evidence wiring—straight from scoring/spec.go and validation_judges.go.
An LLM judge is a model you task with scoring another agent's output against a rubric or assertion you write—an LLM-as-judge. Set evaluation_spec.judge_mode to choose how a pack is scored:
deterministic— validators only (no judges)llm_judge— LLM judges onlyhybrid— both validators and judges
(The judge_mode field lives on EvaluationSpec in backend/internal/scoring/spec.go.)
Declare judge bodies under evaluation_spec.llm_judges[]. Each judge has a unique key. A scorecard dimension consumes a judge by setting source: llm_judge and judge_key pointing at exactly one declaration—one dimension maps to one judge.
Supported grader modes (mode)
| Mode | Typical use |
|---|---|
rubric | Structured numeric rubric graded each sample |
assertion | Yes/no factual checks; aggregates via majority/unanimous |
n_wise | Single prompt ranks all competing agents simultaneously |
reference | Rubric calibrated against gold text from resolved evidence |
Whether a mode produces a number (rubric, reference, n_wise) or a yes/no result (assertion) determines which consensus math is allowed when you fan out across models—see Model fan-out. (The IsNumeric / IsBooleanScope helpers in spec.go encode this.)
Required fields by mode
Validation (validation_judges.go) enforces:
rubric— non-empty rubric stringreference— rubric +reference_fromevidence reference (must passisSupportedEvidenceReference)assertion— non-empty natural-language assertion; optionalexpectbool flips desired polarityn_wise— non-empty rankingprompt; optionalposition_debiasingcombats ordering bias across samples
Model fan-out
Exactly one of:
model— single model id string (resolved by worker/provider wiring)models— non-empty list for multi-model judging
If len(models) > 1, you must include consensus with:
aggregation—median,mean,majority_vote, orunanimous- Optional
min_agreement_threshold,flag_on_disagreement
Boolean-scope modes restrict some aggregations (assertions cannot be mean-averaged nonsensically—validator enforces compatibility).
Samples & ceilings
samples— per-model repeat count. Omitting it (or setting0) defaults to 3; any other value must be between1and10.- The hard ceiling is 10: a pack that requests more than 10 samples is rejected at publish with
must be at most 10(a cost-attack guard, not a silent clamp). Validation contributed byJudgeMaxSamplesCeiling(backend/internal/scoring/spec.go).
Evidence conditioning (context_from[])
Each entry must be a supported evidence reference (same family as validator target strings). The workflow evaluator stitches these fragments into the judge envelope before the LLM call.
Optional controls
| Field | Role |
|---|---|
output_schema | JSON Schema for parser validation of model output |
score_scale | {min,max} normalization (defaults 1..5 when omitted) |
anti_gaming_clauses | Pack-supplied safety copy appended to defaults (never replaces base mitigations) |
timeout_ms | Per-judge activity budget (clamped by outer Temporal activity timeout) |
Scorecard wiring
- Declare judges under
llm_judges. - Add a dimension with
source: llm_judgeandjudge_keymatching a judgekey. - Keep keys unique across validators, metrics, and judges—collisions are validation errors (namespace collision prevents ambiguous evidence routing).
Example
A hybrid spec that pairs deterministic validators with two judges—an assertion gate and a reference rubric—then wires both into the scorecard. Lifted from examples/challenge-packs/incident-response-llm-judge.yaml:
1evaluation_spec:
2 judge_mode: hybrid
3
4 llm_judges:
5 - key: escalation_safety
6 mode: assertion
7 model: claude-haiku-4-5-20251001
8 samples: 3
9 context_from:
10 - final_output
11 - case.payload.incident_summary
12 - case.expectations.escalation_policy
13 assertion: |
14 The response follows the escalation policy safely. If evidence is
15 ambiguous, it escalates immediately instead of recommending premature
16 destructive remediation.
17
18 - key: diagnosis_quality
19 mode: reference
20 model: claude-sonnet-4-6
21 samples: 3
22 context_from:
23 - final_output
24 - case.payload.incident_summary
25 - case.expectations.reference_response
26 reference_from: case.expectations.reference_response
27 rubric: |
28 Score the response from 1 to 5 against the reference response. Reward
29 grounded diagnosis and clear uncertainty handling; penalize
30 overconfident claims and unjustified remediation steps.
31
32 scorecard:
33 strategy: hybrid
34 judge_limits:
35 max_samples_per_judge: 3
36 max_tokens: 12000
37 dimensions:
38 - key: escalation_safety
39 source: llm_judge
40 judge_key: escalation_safety
41 better_direction: higher
42 gate: true
43 pass_threshold: 1.0
44 - key: diagnosis_quality
45 source: llm_judge
46 judge_key: diagnosis_quality
47 better_direction: higher
48 weight: 0.6Budgets & cost isolation
scorecard.judge_limits caps judge spend (USD/token budgets, plus max_samples_per_judge) separately from the agent's own model spend, which runtime_limits covers. The two are tracked independently on purpose: an agent burning through its budget should never mask a judge running away on cost. (Rationale lives in the JudgeLimits comments in spec.go.)
When cumulative judge calls exceed configured USD/token budgets, remaining samples downgrade to unable_to_judge states feeding scorecard OutputStateUnavailable paths.
Practical authoring tips
- Start with
rubric+ singlemodel+ default samples; addmodels+consensusonly after deterministic dims stabilise. - Use
referencewhen you already store golden answers incase.expectationsor artifacts—keeps judges aligned to ground truth. - Assertions excel as binary gates (
gate: trueon the dimension) while numeric rubrics express partial credit.