Agent Skills
Challenge Pack LLM Judges Skill
Use when configuring AgentClash LLM-as-judge scoring, judge prompts, rubrics, assertion/reference/n-wise modes, evidence inputs, scorecard dimensions, abstention behavior, and judge result interpretation.
Canonical source: web/content/agent-skills/challenge-pack-skills/agentclash-challenge-pack-llm-judges/SKILL.md
Markdown export: /docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-llm-judges
Use This Skill When
Use when configuring AgentClash LLM-as-judge scoring, judge prompts, rubrics, assertion/reference/n-wise modes, evidence inputs, scorecard dimensions, abstention behavior, and judge result interpretation.
Full SKILL.md
markdown
1---
2name: agentclash-challenge-pack-llm-judges
3description: Use when configuring AgentClash LLM-as-judge scoring, judge prompts, rubrics, assertion/reference/n-wise modes, evidence inputs, scorecard dimensions, abstention behavior, and judge result interpretation.
4metadata:
5 agentclash.role: challenge-pack-judging
6 agentclash.version: "1"
7 agentclash.requires_cli: "true"
8---
9
10# AgentClash Challenge Pack LLM Judges
11
12## Purpose
13Add LLM-as-judge scoring only where deterministic validators cannot capture the whole evaluation. Judges should complement objective checks, not replace them.
14
15Use this skill after deterministic validators and evidence sources are known. A good judge is narrow, evidence-bound, budget-aware, and wired to one scorecard dimension through `source: llm_judge`.
16
17## Use When
18- Quality depends on reasoning, helpfulness, style, relevance, faithfulness, or nuanced task completion.
19- A deterministic validator would be brittle or incomplete.
20- A pack needs rubric, assertion, reference, or n-wise cross-agent ranking.
21- The scorecard needs judge rationale, confidence, variance, sample count, and model count in replay/scorecard evidence.
22
23## Do Not Use When
24- The behavior can be scored with deterministic validators.
25- Evidence sources are not stable yet; use input-sets, artifacts, and scoring validators first.
26- The run cannot afford extra model calls.
27- The judge would need secrets in prompt text or private data that should not leave the workspace.
28
29## Environment
30Use hosted production for CLI examples unless the user intentionally targets a local or self-hosted backend.
31
32```bash
33export AGENTCLASH_API_URL="https://api.agentclash.dev"
34```
35
36`agentclash challenge-pack validate` calls the hosted API and requires auth plus a workspace. Use `agentclash link`, `--workspace`, `AGENTCLASH_WORKSPACE`, or `.agentclash.yaml` before validating.
37
38## Validation Commands
39Validate after changing `judge_mode`, `llm_judges`, judge evidence, scorecard judge dimensions, consensus, or judge limits.
40
41```bash
42agentclash challenge-pack validate path/to/pack.yaml
43agentclash challenge-pack validate path/to/pack.yaml --json
44```
45
46Human output prints `Challenge pack is valid` or `Challenge pack has errors`. Use `--json` for structured `valid` and `errors` fields.
47
48## Minimal Hybrid Shape
49Current evaluation specs still require at least one deterministic validator. Use `judge_mode: hybrid` when judges are paired with validators.
50
51```yaml
52version:
53 evaluation_spec:
54 name: support-quality
55 version_number: 1
56 judge_mode: hybrid
57 validators:
58 - key: mentions_policy
59 type: contains
60 target: final_output
61 expected_from: literal:refund policy
62 llm_judges:
63 - mode: rubric
64 key: helpfulness
65 model: gpt-4o
66 samples: 3
67 context_from:
68 - challenge_input
69 - final_output
70 rubric: |
71 Score 1-5 for whether the answer is correct, complete, and easy for a support agent to use.
72 scorecard:
73 strategy: weighted
74 dimensions:
75 - key: correctness
76 source: validators
77 validators:
78 - mentions_policy
79 weight: 0.5
80 - key: helpfulness
81 source: llm_judge
82 judge_key: helpfulness
83 weight: 0.5
84```
85
86Mode coherence rules:
87
88- `judge_mode: deterministic`: `llm_judges` must be empty.
89- `judge_mode: llm_judge`: `llm_judges` must contain at least one judge.
90- `judge_mode: hybrid`: `llm_judges` must contain at least one judge, and validators are still required by the current scoring validator.
91
92## Judge Fields
93Every `llm_judges[]` entry uses this source-backed shape:
94
95```yaml
96llm_judges:
97 - mode: rubric
98 key: stable_judge_key
99 model: gpt-4o
100 context_from:
101 - final_output
102 rubric: Rate the output 1-5.
103```
104
105Fields:
106
107- `mode`: required; one of `rubric`, `assertion`, `reference`, or `n_wise`.
108- `key`: required, unique, and must not collide with validator or metric keys.
109- `model`: one model identifier.
110- `models`: list of model identifiers. Set exactly one of `model` or `models`.
111- `samples`: optional integer from `0` to `10`; `0` means default samples, currently `3`.
112- `context_from`: optional list of supported evidence references.
113- `output_schema`: optional JSON Schema draft-07 or 2020-12. Validation checks the schema; current judge message builders still instruct the built-in JSON response shape.
114- `score_scale`: optional `{min, max}` for `rubric` and `reference`; `min` must be strictly less than `max`, default is `1..5`.
115- `rubric`: required for `rubric` and `reference`.
116- `assertion`: required for `assertion`.
117- `expect`: optional boolean for `assertion`; when false, it flips the pass polarity.
118- `prompt`: required for `n_wise`.
119- `position_debiasing`: optional boolean for `n_wise`; rotates candidate order across samples.
120- `reference_from`: required for `reference`; must be a supported evidence reference.
121- `consensus`: required when `models` has more than one entry; invalid otherwise.
122- `anti_gaming_clauses`: optional extra prompt clauses appended after the built-in judge instructions.
123- `timeout_ms`: optional integer greater than `0`; default judge timeout is 60 seconds.
124
125## Supported Evidence
126Judge `context_from` and `reference_from` entries must use supported evidence references:
127
128- `final_output`
129- `run.final_output`
130- `challenge_input`
131- `case.payload`
132- `case.payload.<field>`
133- `case.inputs.<input_key>`
134- `case.expectations.<expectation_key>`
135- `artifact.<artifact_key>[.<field>]`
136- `file:<post_execution_check_key>`
137- `literal:<value>`
138
139Each `context_from` entry is injected into the judge prompt as `<reference>:\n<resolved value>`. Reference judges also inject `reference_answer` from `reference_from`.
140
141If any required judge context or reference evidence is unavailable, the judge result becomes unavailable with a `reason`; it does not produce a `normalized_score`.
142
143## Rubric Mode
144Use `mode: rubric` for subjective numeric scoring against a written rubric.
145
146```yaml
147llm_judges:
148 - mode: rubric
149 key: persuasiveness
150 model: claude-sonnet-4-6
151 samples: 3
152 context_from:
153 - challenge_input
154 - final_output
155 score_scale:
156 min: 1
157 max: 5
158 rubric: |
159 Score 1-5.
160 1: Incorrect, incomplete, or unsupported.
161 3: Mostly correct but misses important nuance.
162 5: Correct, concise, evidence-bound, and directly useful.
163scorecard:
164 dimensions:
165 - key: persuasiveness
166 source: llm_judge
167 judge_key: persuasiveness
168```
169
170The judge is instructed to return JSON shaped like:
171
172```json
173{"score": 4, "confidence": "low|medium|high", "reasoning": "brief rationale"}
174```
175
176Rubric and reference scores are clamped to the configured `score_scale` and normalized to `0..1` for the scorecard dimension.
177
178## Assertion Mode
179Use `mode: assertion` for yes/no claims such as safety, groundedness, or policy compliance.
180
181```yaml
182llm_judges:
183 - mode: assertion
184 key: no_hallucination
185 model: claude-haiku-4-5-20251001
186 context_from:
187 - case.payload.source_excerpt
188 - final_output
189 assertion: The response contains only information supported by the source excerpt.
190 expect: true
191scorecard:
192 strategy: hybrid
193 dimensions:
194 - key: no_hallucination
195 source: llm_judge
196 judge_key: no_hallucination
197 gate: true
198 pass_threshold: 1.0
199```
200
201The judge is instructed to return JSON shaped like:
202
203```json
204{"pass": true, "confidence": "low|medium|high", "reasoning": "brief rationale"}
205```
206
207The parser also accepts `verdict` values such as `pass`, `true`, `yes`, `fail`, `false`, or `no`. Assertion samples aggregate by majority and normalize to `1` or `0`.
208
209## Reference Mode
210Use `mode: reference` when a gold answer or expected artifact exists.
211
212```yaml
213llm_judges:
214 - mode: reference
215 key: summary_quality
216 model: gpt-4o
217 context_from:
218 - final_output
219 reference_from: case.expectations.reference_summary
220 rubric: |
221 Compare the response to the reference summary for coverage, faithfulness, and concision.
222 Penalize unsupported additions.
223scorecard:
224 dimensions:
225 - key: summary_quality
226 source: llm_judge
227 judge_key: summary_quality
228```
229
230`reference_from` must resolve to available evidence. If it does not, the judge is unavailable.
231
232## N-Wise Mode
233Use `mode: n_wise` to rank all run agents in the same run against one another.
234
235```yaml
236llm_judges:
237 - mode: n_wise
238 key: overall_quality
239 model: claude-sonnet-4-6
240 samples: 3
241 position_debiasing: true
242 context_from:
243 - final_output
244 prompt: Rank the candidate outputs from best to worst on correctness, completeness, and clarity.
245scorecard:
246 dimensions:
247 - key: overall
248 source: llm_judge
249 judge_key: overall_quality
250```
251
252The judge is instructed to return JSON shaped like:
253
254```json
255{"ranking": ["<run_agent_id>", "..."], "confidence": "low|medium|high", "reasoning": "brief rationale"}
256```
257
258The parser also accepts `ranked_ids`. Every candidate must appear exactly once. `n_wise` requires at least two run agents in the run. The current run agent receives a normalized Borda-style score from its rank.
259
260## Multi-Model Consensus
261Use `models` only when you need cross-model agreement. Multiple models require `consensus`.
262
263```yaml
264llm_judges:
265 - mode: rubric
266 key: quality
267 models:
268 - claude-sonnet-4-6
269 - gpt-4o
270 consensus:
271 aggregation: median
272 min_agreement_threshold: 0.6
273 flag_on_disagreement: true
274 rubric: Rate overall quality 1-5.
275```
276
277Consensus rules:
278
279- `aggregation`: `median`, `mean`, `majority_vote`, or `unanimous`.
280- `median` and `mean` are valid only for numeric modes: `rubric`, `reference`, and `n_wise`.
281- `majority_vote` is valid only for `assertion`.
282- `unanimous` is valid for numeric and assertion modes.
283- `min_agreement_threshold` must be between `0` and `1`.
284- `flag_on_disagreement` is optional boolean.
285
286Single-model judges must not include `consensus`.
287
288## Judge Limits
289Judge limits live under `scorecard.judge_limits`.
290
291```yaml
292scorecard:
293 judge_limits:
294 max_samples_per_judge: 3
295 max_calls_usd: 2.50
296 max_tokens: 50000
297```
298
299Validation rules:
300
301- `max_samples_per_judge`: `0..10`; `0` means use each judge's own `samples` defaulting behavior.
302- `max_calls_usd`: greater than or equal to `0`.
303- `max_tokens`: greater than or equal to `0`.
304
305Current pack validation range-checks these budget knobs. Still keep each judge's `samples` and `models` small: every judge times every sample times every model can create one LLM call.
306
307## Scorecard Wiring
308Each judge-backed dimension must point at exactly one judge key.
309
310```yaml
311scorecard:
312 strategy: hybrid
313 dimensions:
314 - key: deterministic_correctness
315 source: validators
316 validators:
317 - mentions_policy
318 weight: 0.5
319 - key: groundedness
320 source: llm_judge
321 judge_key: no_hallucination
322 better_direction: higher
323 gate: true
324 pass_threshold: 1.0
325 - key: helpfulness
326 source: llm_judge
327 judge_key: helpfulness
328 weight: 0.5
329```
330
331Rules:
332
333- `source: llm_judge` requires `judge_key`.
334- `judge_key` must reference an existing `llm_judges[].key`.
335- `judge_key` must be empty for non-`llm_judge` dimensions.
336- `better_direction`, when present for `llm_judge`, must be `higher`.
337- All current judge modes produce numeric normalized scores that can feed dimensions.
338- `strategy: hybrid` requires at least one gated dimension.
339
340## Abstention And Unavailable Results
341There is no `abstention_rule`, `abstain`, or `unable_to_judge` field in the YAML authoring surface.
342
343Judges become unavailable when:
344
345- `context_from` evidence cannot resolve.
346- `reference_from` evidence cannot resolve.
347- an `n_wise` run has fewer than two run agents.
348- all model calls fail or return unparsable output.
349- no provider client or credential path is configured for the judge model.
350
351An unavailable judge result has no `normalized_score`; it keeps a `reason` and payload details such as failed calls or `unable_to_judge_count`. A `source: llm_judge` dimension with no normalized score becomes unavailable.
352
353## Result Shape
354Judge-backed scorecards and replay surfaces can include `llm_judge_results` with:
355
356- `judge_key`
357- `mode`
358- `normalized_score`
359- `payload`
360- `confidence`
361- `variance`
362- `sample_count`
363- `model_count`
364- `reason`
365
366`payload` includes call records, model scores, aggregated score, warnings, and n-wise candidates when applicable.
367
368## Security And Anti-Gaming
369- Do not put raw secrets in `rubric`, `assertion`, `prompt`, `anti_gaming_clauses`, `context_from`, `reference_from`, or `literal:` values.
370- Validation rejects `${secrets.*}` references in `rubric`, `assertion`, and `prompt`; avoid them everywhere else too.
371- `anti_gaming_clauses` are additive. The evaluator still injects built-in judge instructions and default anti-gaming behavior.
372- Keep evidence narrow. Judges receive resolved context text in provider requests.
373- Prefer deterministic validators for hard safety, schema, file, and policy constraints; judges should score nuance.
374
375## Common Validation Failures
376- `judge_mode: deterministic` includes `llm_judges`.
377- `judge_mode: llm_judge` or `hybrid` has no `llm_judges`.
378- A judge `key` is empty, duplicated, or collides with a validator/metric key.
379- `mode` is not `rubric`, `assertion`, `reference`, or `n_wise`.
380- `rubric` mode omits `rubric`.
381- `reference` mode omits `rubric` or `reference_from`.
382- `assertion` mode omits `assertion`.
383- `n_wise` mode omits `prompt`.
384- Both `model` and `models` are set, or neither is set.
385- `samples` is negative or greater than `10`.
386- `models` has more than one model but no `consensus`.
387- `consensus` appears on a single-model judge.
388- `context_from` or `reference_from` is not a supported evidence reference.
389- `score_scale.min >= score_scale.max`.
390- `timeout_ms <= 0`.
391- `scorecard.judge_limits` values are out of range.
392- `source: llm_judge` dimension omits `judge_key` or references an unknown judge.
393
394## Authoring Procedure
3951. Keep deterministic validators for hard checks.
3962. Choose the judge mode: `rubric`, `assertion`, `reference`, or `n_wise`.
3973. Pick the smallest useful `context_from` evidence set.
3984. Write rubric/assertion/prompt text that tells the judge how to use only that evidence.
3995. Select one `model`, or use `models` plus `consensus` only when cross-model agreement matters.
4006. Keep `samples` low; use `scorecard.judge_limits` for budget guardrails.
4017. Wire each judge to one `source: llm_judge` scorecard dimension with `judge_key`.
4028. Pair hard gates with deterministic validators or assertion judges.
4039. Validate with `agentclash challenge-pack validate path/to/pack.yaml --json`.
40410. Report judge keys, modes, evidence refs, model fan-out, budget settings, and any unavailable-evidence risks.
405
406## Report Back Format
407```text
408Judge mode:
409Judges:
410- key:
411 mode:
412 model/models:
413 samples:
414 context_from:
415 reference_from:
416 consensus:
417 score_scale:
418 anti_gaming_clauses:
419Scorecard dimensions:
420Validator pairings:
421Judge limits:
422Unavailable-evidence risks:
423Security review:
424Validation command:
425Validation result:
426Open issues:
427```
428
429## Related Skills
430- `agentclash-challenge-pack-scoring-validators`
431- `agentclash-challenge-pack-artifacts`
432- `agentclash-challenge-pack-input-sets`
433- `agentclash-challenge-pack-validation-publish`
434- `agentclash-eval-runner`
435- `agentclash-scorecard-reader`