Agent Skills
Challenge Pack Scoring Validators Skill
Use when defining deterministic AgentClash scoring validators, scorecard dimensions, evidence sources, pass/fail rules, numeric metrics, file checks, and validator result interpretation.
Canonical source: web/content/agent-skills/challenge-pack-skills/agentclash-challenge-pack-scoring-validators/SKILL.md
Markdown export: /docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-scoring-validators
Use This Skill When
Use when defining deterministic AgentClash scoring validators, scorecard dimensions, evidence sources, pass/fail rules, numeric metrics, file checks, and validator result interpretation.
Full SKILL.md
markdown
1---
2name: agentclash-challenge-pack-scoring-validators
3description: Use when defining deterministic AgentClash scoring validators, scorecard dimensions, evidence sources, pass/fail rules, numeric metrics, file checks, and validator result interpretation.
4metadata:
5 agentclash.role: challenge-pack-scoring
6 agentclash.version: "1"
7 agentclash.requires_cli: "true"
8---
9
10# AgentClash Challenge Pack Scoring Validators
11
12## Purpose
13Design deterministic scoring that is valid, explainable, and stable enough for CI, regression, and benchmark comparisons.
14
15Use deterministic validators when objective evidence can prove the behavior. Reach for LLM judges only when the output truly needs subjective or rubric-based assessment.
16
17## Use When
18- A pack needs `version.evaluation_spec.validators`.
19- Scoring can use exact text, JSON, numeric, math, file, directory, or code-execution evidence.
20- A scorecard dimension should average one or more validator results.
21- A pack needs numeric run metrics such as latency, token count, tool calls, cost, or validator pass rate.
22- A reviewer needs to understand why a validator passed, failed, errored, or was unavailable.
23
24## Do Not Use When
25- The challenge, cases, or artifacts are still undefined; use the planner, input-sets, and artifacts skills first.
26- The evaluation needs rubric, assertion, n_wise, or reference judging; use `agentclash-challenge-pack-llm-judges`.
27- The task is publishing or running an already authored pack; use validation/publish or eval-runner skills.
28
29## Environment
30Use hosted production for CLI examples unless the user intentionally targets a local or self-hosted backend.
31
32```bash
33export AGENTCLASH_API_URL="https://api.agentclash.dev"
34```
35
36`agentclash challenge-pack validate` calls the hosted API and requires auth plus a workspace. Use `agentclash link`, `--workspace`, `AGENTCLASH_WORKSPACE`, or `.agentclash.yaml` before validating.
37
38## Validation Commands
39Validate after changing validators, metrics, dimensions, strategies, file captures, or evidence references.
40
41```bash
42agentclash challenge-pack validate path/to/pack.yaml
43agentclash challenge-pack validate path/to/pack.yaml --json
44```
45
46Human output prints `Challenge pack is valid` or `Challenge pack has errors`. Use `--json` for structured `valid` and `errors` fields.
47
48## Evaluation Spec Shape
49Deterministic scoring lives under `version.evaluation_spec`.
50
51```yaml
52version:
53 evaluation_spec:
54 name: support-answer-scoring
55 version_number: 1
56 judge_mode: deterministic
57 validators:
58 - key: mentions_refund_window
59 type: contains
60 target: final_output
61 expected_from: literal:30 days
62 metrics:
63 - key: latency_ms
64 type: numeric
65 collector: run_total_latency_ms
66 unit: ms
67 scorecard:
68 strategy: weighted
69 dimensions:
70 - key: correctness
71 source: validators
72 validators:
73 - mentions_refund_window
74 weight: 1
75```
76
77Required fields:
78
79- `name`: required non-empty string.
80- `version_number`: required integer greater than `0`.
81- `judge_mode`: `deterministic`, `llm_judge`, or `hybrid`; use `deterministic` for this skill.
82- `validators`: required and must contain at least one validator.
83- `scorecard.dimensions`: required and must contain at least one dimension.
84
85Optional scoring sections used by this skill:
86
87- `metrics`: run metric declarations.
88- `post_execution_checks`: file or directory capture declarations used by file validators.
89- `runtime_limits`, `pricing`, and `behavioral` exist, but use focused skills unless they are directly needed for scoring.
90
91## Validator Fields
92Every validator has this source-backed shape:
93
94```yaml
95validators:
96 - key: stable_validator_key
97 type: exact_match
98 target: final_output
99 expected_from: literal:approved
100 config: {}
101```
102
103Fields:
104
105- `key`: required, trimmed, and unique.
106- `type`: required and must be one of the supported validator types below.
107- `target`: required supported evidence reference.
108- `expected_from`: required for most validators; omitted for `file_exists`, `file_json_schema`, `directory_structure`, `code_execution`, `tool_call_assertion`, and `postcondition`.
109- `config`: optional JSON/YAML object interpreted by the validator type.
110
111There is no validator-level `failure_message`, `pass_message`, or custom result text field. Results emit `state`, `verdict`, `normalized_score`, `reason`, `raw_output`, `target`, `expected_from`, `actual_value`, and `expected_value`; the reason text is produced by the scorer.
112
113## Supported Validator Types
114These are the exact validator type strings accepted by the scoring model:
115
116```text
117exact_match
118contains
119regex_match
120json_schema
121json_path_match
122boolean_assert
123fuzzy_match
124numeric_match
125normalized_match
126token_f1
127math_equivalence
128bleu_score
129rouge_score
130chrf_score
131file_content_match
132file_exists
133file_json_schema
134directory_structure
135code_execution
136tool_call_assertion
137postcondition
138```
139
140Do not use `has_json`, `json_equals`, `semantic_match`, `unit_test`, `shell`, or provider-specific names; the validator rejects unknown `type` values.
141
142## Evidence References
143Validator `target` and required `expected_from` values must use supported evidence references:
144
145- `final_output`
146- `run.final_output`
147- `challenge_input`
148- `case.payload`
149- `case.payload.<field>`
150- `case.inputs.<input_key>`
151- `case.expectations.<expectation_key>`
152- `artifact.<artifact_key>[.<field>]`
153- `file:<post_execution_check_key>`
154- `literal:<value>`
155- `tool_calls` (only for `tool_call_assertion`)
156
157## Postconditions
158Use `postcondition` to score post-run captured files or directory listings without shelling out through `code_execution`. It must use `target: file:<post_execution_check_key>` and omit `expected_from`. Its strict `config.condition` supports `exists`, `not_exists`, `contains`, `not_contains`, `regex_match`, `json_path_match`, and `equals`.
159
160## Tool Call Assertions
161Use `tool_call_assertion` to score executed tool-call traces without asking the final answer to self-report behavior. It must use `target: tool_calls` and does not use `expected_from`.
162
163```yaml
164validators:
165 - key: submitted_answer
166 type: tool_call_assertion
167 target: tool_calls
168 config:
169 tool_name: submit
170 must_call: true
171 arguments_contain:
172 answer: "42"
173```
174
175Config supports `tool_name`, `must_call`, `count`, `min_count`, `max_count`, `arguments_contain`, `ordered_tools`, and `order_mode`. `order_mode` is `subsequence` by default and can be `exact`. Scorecard evidence includes counts, matched indices, and tool names, but not raw tool arguments.
176
177Use `literal:` for inline expected values. Use `case.expectations.<key>` or `artifact.<artifact_key>.path` when the expected value should come from case evidence rather than the skill text.
178
179## Common Text And JSON Validators
180Use these when final output or case evidence is already text or JSON.
181
182```yaml
183validators:
184 - key: exact_decision
185 type: exact_match
186 target: case.payload.expected_decision
187 expected_from: literal:approve
188
189 - key: contains_policy_term
190 type: contains
191 target: final_output
192 expected_from: literal:refund window
193
194 - key: matches_ticket_pattern
195 type: regex_match
196 target: final_output
197 expected_from: literal:TICKET-[0-9]+
198
199 - key: response_is_schema_valid
200 type: json_schema
201 target: final_output
202 expected_from: 'literal:{"type":"object","required":["decision"],"properties":{"decision":{"type":"string"}}}'
203
204 - key: decision_is_approved
205 type: json_path_match
206 target: final_output
207 expected_from: 'literal:{"path":"$.decision","comparator":"equals","value":"approve"}'
208
209 - key: escalation_flag
210 type: boolean_assert
211 target: case.payload.should_escalate
212 expected_from: literal:true
213```
214
215`json_path_match` expected values are either a JSON object with `path`, optional `comparator`, and optional `value`, or a path string that starts with `$` for an existence check. Supported comparators are `equals`, `contains`, `greater_than`, `less_than`, and `exists`.
216
217## Similarity, Numeric, And Math Validators
218These validators accept typed `config` fields.
219
220```yaml
221validators:
222 - key: answer_fuzzy
223 type: fuzzy_match
224 target: final_output
225 expected_from: case.expectations.answer
226 config:
227 threshold: 0.85
228 case_insensitive: true
229 normalize: true
230
231 - key: total_matches
232 type: numeric_match
233 target: case.payload.agent_total
234 expected_from: case.expectations.expected_total
235 config:
236 absolute_tolerance: 0.01
237 extract_number: true
238
239 - key: normalized_phrase
240 type: normalized_match
241 target: final_output
242 expected_from: literal:refund window is 30 days
243 config:
244 pipeline:
245 - trim
246 - lowercase
247 - collapse_whitespace
248
249 - key: token_overlap
250 type: token_f1
251 target: final_output
252 expected_from: case.expectations.answer
253 config:
254 threshold: 0.75
255 normalize: true
256 remove_articles: true
257 remove_punctuation: true
258
259 - key: formula_equivalent
260 type: math_equivalence
261 target: final_output
262 expected_from: literal:x^2 + 2*x + 1
263 config:
264 comparison_mode: symbolic
265```
266
267Source-backed config notes:
268
269- `fuzzy_match.threshold` and `token_f1.threshold` must be between `0` and `1` when set.
270- `numeric_match` accepts `absolute_tolerance`, `relative_tolerance`, `extract_number`, `significant_digits`, `tolerance_mode`, and `tolerance`; tolerances must be non-negative, and `significant_digits` must be greater than `0` when set.
271- `normalized_match.pipeline` accepts `trim`, `lowercase`, `collapse_whitespace`, `strip_punctuation`, `strip_currency`, `strip_formatting`, `normalize_unicode`, `remove_articles`, `sort_words`, and `sort_lines`.
272- `math_equivalence.comparison_mode` must be `symbolic` or `numeric`; `tolerance` must be non-negative.
273
274## Generation-Style Validators
275Use these for text similarity against references when exact wording is not required.
276
277```yaml
278validators:
279 - key: bleu_reference_overlap
280 type: bleu_score
281 target: final_output
282 expected_from: case.expectations.answer
283 config:
284 threshold: 0.4
285 max_ngram: 4
286 smoothing: method1
287
288 - key: rouge_summary_overlap
289 type: rouge_score
290 target: final_output
291 expected_from: case.expectations.answer
292 config:
293 threshold: 0.5
294 variant: rouge-l
295
296 - key: chrf_summary_overlap
297 type: chrf_score
298 target: final_output
299 expected_from: case.expectations.answer
300 config:
301 threshold: 0.5
302 char_order: 6
303```
304
305Config validation:
306
307- `bleu_score.smoothing` must be `none` or `method1`; `max_ngram` must be greater than `0`.
308- `rouge_score.variant` must be `rouge-1`, `rouge-2`, or `rouge-l`; `beta` must be greater than `0` when set.
309- `chrf_score.char_order` and `chrf_score.beta` must be greater than `0` when set.
310
311## File And Directory Validators
312File validators must use `target: file:<post_execution_check_key>`. Declare the capture first with `version.evaluation_spec.post_execution_checks`.
313
314```yaml
315version:
316 execution_mode: native
317 tool_policy:
318 allowed_tool_kinds:
319 - file
320 - build
321 evaluation_spec:
322 name: file-scoring
323 version_number: 1
324 judge_mode: deterministic
325 post_execution_checks:
326 - key: generated_summary
327 type: file_capture
328 path: /workspace/summary.json
329 - key: project_listing
330 type: directory_listing
331 path: /workspace
332 recursive: true
333 validators:
334 - key: summary_exists
335 type: file_exists
336 target: file:generated_summary
337 - key: summary_matches_schema
338 type: file_json_schema
339 target: file:generated_summary
340 config:
341 schema:
342 type: object
343 required:
344 - decision
345 - key: no_secret_file
346 type: directory_structure
347 target: file:project_listing
348 config:
349 forbidden_files:
350 - .env
351 - key: summary_mentions_decision
352 type: file_content_match
353 target: file:generated_summary
354 expected_from: literal:decision
355 config:
356 match_mode: contains
357```
358
359File validator rules:
360
361- `file_content_match` requires `expected_from` and supports `match_mode`: `exact`, `contains`, `regex`, `not_contains`, or `json_equal`; default is `contains`.
362- `file_exists` defaults to `must_exist: true`; set `config.must_exist: false` when the file must be absent.
363- `file_json_schema` requires `config.schema`.
364- `directory_structure` requires config and supports `required_files`, `forbidden_files`, and `required_directories`.
365- If any validator target starts with `file:` and checks are declared, the key must match a `post_execution_checks[].key`.
366
367## Code Execution Validator
368`code_execution` is a file validator. Its `target` must reference a `file_capture`, not a `directory_listing`, and `config.test_command` is required.
369
370```yaml
371post_execution_checks:
372 - key: generated_code
373 type: file_capture
374 path: /workspace/app.py
375validators:
376 - key: generated_code_tests
377 type: code_execution
378 target: file:generated_code
379 config:
380 test_command: python -m pytest tests/ -q
381 timeout_ms: 30000
382 scoring: fraction_passed
383 pass_threshold: 0.8
384```
385
386Source-backed config:
387
388- `test_command`: required non-empty string.
389- `timeout_ms`: optional integer greater than `0`.
390- `scoring`: `fraction_passed` or `all_or_nothing`; `pass_at_k` is defined but currently rejected.
391- `pass_threshold`: optional number between `0` and `1`; default effective threshold is `1.0`.
392
393## Metrics
394Metrics have `key`, `type`, `collector`, and optional `unit`.
395
396```yaml
397metrics:
398 - key: latency_ms
399 type: numeric
400 collector: run_total_latency_ms
401 unit: ms
402 - key: validator_rate
403 type: numeric
404 collector: validator_pass_rate
405```
406
407Metric `type` must be `numeric`, `text`, or `boolean`. The schema accepts `text`, but the current implemented collectors produce numeric or boolean values. The scorer currently implements these collectors:
408
409```text
410run_total_latency_ms
411run_ttft_ms
412run_input_tokens
413run_output_tokens
414run_total_tokens
415run_tool_call_count
416run_agent_tokens
417run_race_context_tokens
418run_model_cost_usd
419run_completed_successfully
420run_failure_count
421behavioral_recovery_score
422behavioral_exploration_efficiency_score
423behavioral_error_cascade_score
424behavioral_scope_adherence_score
425behavioral_confidence_calibration_score
426validator_pass_rate
427```
428
429Validation rejects `behavioral_confidence_calibration_score` for metrics until confidence reporting lands, even though the engine has a collector branch. Avoid it in new packs.
430
431## Scorecard Dimensions
432Use object-form dimensions for source-fidelity and explicit routing.
433
434```yaml
435scorecard:
436 strategy: weighted
437 pass_threshold: 0.8
438 dimensions:
439 - key: correctness
440 source: validators
441 validators:
442 - mentions_refund_window
443 - decision_is_approved
444 weight: 0.8
445 gate: true
446 pass_threshold: 0.9
447 - key: speed
448 source: metric
449 metric: latency_ms
450 better_direction: lower
451 normalization:
452 target: 1000
453 max: 60000
454 weight: 0.2
455```
456
457Dimension fields:
458
459- `key`: required and unique.
460- `source`: `validators`, `metric`, `reliability`, `latency`, `cost`, `behavioral`, or `llm_judge`.
461- `validators`: optional list of validator keys when `source: validators`; omitted means average all validators.
462- `metric`: required when `source: metric` and must reference `metrics[].key`.
463- `better_direction`: required for `metric`, `latency`, and `cost`; must be `higher` or `lower`.
464- `normalization.target` and `normalization.max`: required for `metric`, `latency`, and `cost`.
465- `weight`: optional and must be greater than or equal to `0`.
466- `gate`: optional boolean.
467- `pass_threshold`: required when `gate: true` or when `strategy: binary`; must be between `0` and `1`.
468- `judge_key`: only valid when `source: llm_judge`; use the LLM judges skill for that path.
469
470Strategy rules:
471
472- Missing `strategy` defaults to `weighted`.
473- `weighted`: optional scorecard-level `pass_threshold`; explicit gates are allowed.
474- `binary`: every dimension is implicitly gated, every dimension needs `pass_threshold`, and scorecard-level `pass_threshold` must not be set.
475- `hybrid`: requires at least one `gate: true`; gates must pass and the non-gate weighted average must clear any scorecard-level threshold.
476
477## Result Interpretation
478Validator results can be:
479
480- `verdict: pass`: evidence was available and the validator condition passed.
481- `verdict: fail`: evidence was available and the condition failed.
482- `verdict: error`: evidence existed but parsing, config, regex, schema, JSONPath, or execution-result interpretation errored.
483- unavailable state with no verdict: target or expected evidence could not be resolved.
484
485Each available validator contributes `normalized_score` on a `0..1` scale. `source: validators` dimensions average the scoped validator scores; if scoped validators are unavailable, the dimension is unavailable.
486
487## Common Validation And Scoring Failures
488- `validators` is empty.
489- Duplicate validator `key`.
490- Unknown validator type such as `has_json`.
491- Missing `target`.
492- Missing `expected_from` for a validator that requires it.
493- `target` or `expected_from` is not a supported evidence reference.
494- A file validator targets `final_output` instead of `file:<post_execution_check_key>`.
495- A `file:` target references a missing `post_execution_checks` key.
496- `code_execution` targets a `directory_listing`.
497- `file_json_schema` omits `config.schema`; this becomes a scoring error if it slips past pack validation.
498- `directory_structure` omits `config`; this becomes a scoring error if it slips past pack validation.
499- `code_execution` omits `config.test_command`; validation catches this when config is present, and scoring cannot produce a useful result without it.
500- `metric` dimensions omit `normalization`.
501- `binary` strategy sets scorecard-level `pass_threshold`.
502- `hybrid` strategy has no gated dimension.
503- A non-`llm_judge` dimension includes `judge_key`.
504
505## Authoring Procedure
5061. Identify the evidence source for each behavior: final output, case payload, case input, expectation, artifact metadata, captured file, or literal.
5072. Pick the simplest supported validator type that proves the claim.
5083. Add `expected_from` unless the validator type explicitly does not require it.
5094. Keep file checks under `post_execution_checks` and target them with `file:<key>`.
5105. Group validators into scorecard dimensions with `source: validators`.
5116. Add numeric metrics only when the scorecard needs latency, cost, token, tool, completion, failure, or pass-rate signals.
5127. Add gates and pass thresholds only for hard requirements.
5138. Run `agentclash challenge-pack validate path/to/pack.yaml --json` and fix every returned field error.
5149. Report which validators are scored, which are gates, and which evidence refs each one reads.
515
516## Safety
517- Do not put secrets in `literal:` expected values, captured files, artifact metadata, or validator config.
518- Keep `file_capture` paths narrow; captured content becomes scoring evidence.
519- Prefer deterministic fixture expectations over live mutable data.
520- Use regex and JSONPath carefully so failures explain behavior instead of implementation trivia.
521- Avoid scoring on private customer data unless retention and access are approved.
522
523## Report Back Format
524```text
525Evaluation spec:
526Validator summary:
527- key:
528 type:
529 target:
530 expected_from:
531 config:
532 score dimension:
533 gate: <yes/no>
534Metrics:
535Scorecard:
536- strategy:
537- dimensions:
538File captures:
539Evidence references:
540Validation command:
541Validation result:
542Expected result fields:
543Open issues:
544```
545
546## Related Skills
547- `agentclash-challenge-pack-input-sets`
548- `agentclash-challenge-pack-artifacts`
549- `agentclash-challenge-pack-tools-sandbox`
550- `agentclash-challenge-pack-llm-judges`
551- `agentclash-challenge-pack-validation-publish`
552- `agentclash-eval-runner`
553- `agentclash-scorecard-reader`