Agent Skills

Challenge Pack Scoring Validators Skill

Use when defining deterministic AgentClash scoring validators, scorecard dimensions, evidence sources, pass/fail rules, numeric metrics, file checks, and validator result interpretation.

Canonical source: web/content/agent-skills/challenge-pack-skills/agentclash-challenge-pack-scoring-validators/SKILL.md

Markdown export: /docs-md/agent-skills/challenge-pack-skills/agentclash-challenge-pack-scoring-validators

Use This Skill When

Use when defining deterministic AgentClash scoring validators, scorecard dimensions, evidence sources, pass/fail rules, numeric metrics, file checks, and validator result interpretation.

Full SKILL.md

markdown
1---
2name: agentclash-challenge-pack-scoring-validators
3description: Use when defining deterministic AgentClash scoring validators, scorecard dimensions, evidence sources, pass/fail rules, numeric metrics, file checks, and validator result interpretation.
4metadata:
5  agentclash.role: challenge-pack-scoring
6  agentclash.version: "1"
7  agentclash.requires_cli: "true"
8---
9
10# AgentClash Challenge Pack Scoring Validators
11
12## Purpose
13Design deterministic scoring that is valid, explainable, and stable enough for CI, regression, and benchmark comparisons.
14
15Use deterministic validators when objective evidence can prove the behavior. Reach for LLM judges only when the output truly needs subjective or rubric-based assessment.
16
17## Use When
18- A pack needs `version.evaluation_spec.validators`.
19- Scoring can use exact text, JSON, numeric, math, file, directory, or code-execution evidence.
20- A scorecard dimension should average one or more validator results.
21- A pack needs numeric run metrics such as latency, token count, tool calls, cost, or validator pass rate.
22- A reviewer needs to understand why a validator passed, failed, errored, or was unavailable.
23
24## Do Not Use When
25- The challenge, cases, or artifacts are still undefined; use the planner, input-sets, and artifacts skills first.
26- The evaluation needs rubric, assertion, n_wise, or reference judging; use `agentclash-challenge-pack-llm-judges`.
27- The task is publishing or running an already authored pack; use validation/publish or eval-runner skills.
28
29## Environment
30Use hosted production for CLI examples unless the user intentionally targets a local or self-hosted backend.
31
32```bash
33export AGENTCLASH_API_URL="https://api.agentclash.dev"
34```
35
36`agentclash challenge-pack validate` calls the hosted API and requires auth plus a workspace. Use `agentclash link`, `--workspace`, `AGENTCLASH_WORKSPACE`, or `.agentclash.yaml` before validating.
37
38## Validation Commands
39Validate after changing validators, metrics, dimensions, strategies, file captures, or evidence references.
40
41```bash
42agentclash challenge-pack validate path/to/pack.yaml
43agentclash challenge-pack validate path/to/pack.yaml --json
44```
45
46Human output prints `Challenge pack is valid` or `Challenge pack has errors`. Use `--json` for structured `valid` and `errors` fields.
47
48## Evaluation Spec Shape
49Deterministic scoring lives under `version.evaluation_spec`.
50
51```yaml
52version:
53  evaluation_spec:
54    name: support-answer-scoring
55    version_number: 1
56    judge_mode: deterministic
57    validators:
58      - key: mentions_refund_window
59        type: contains
60        target: final_output
61        expected_from: literal:30 days
62    metrics:
63      - key: latency_ms
64        type: numeric
65        collector: run_total_latency_ms
66        unit: ms
67    scorecard:
68      strategy: weighted
69      dimensions:
70        - key: correctness
71          source: validators
72          validators:
73            - mentions_refund_window
74          weight: 1
75```
76
77Required fields:
78
79- `name`: required non-empty string.
80- `version_number`: required integer greater than `0`.
81- `judge_mode`: `deterministic`, `llm_judge`, or `hybrid`; use `deterministic` for this skill.
82- `validators`: required and must contain at least one validator.
83- `scorecard.dimensions`: required and must contain at least one dimension.
84
85Optional scoring sections used by this skill:
86
87- `metrics`: run metric declarations.
88- `post_execution_checks`: file or directory capture declarations used by file validators.
89- `runtime_limits`, `pricing`, and `behavioral` exist, but use focused skills unless they are directly needed for scoring.
90
91## Validator Fields
92Every validator has this source-backed shape:
93
94```yaml
95validators:
96  - key: stable_validator_key
97    type: exact_match
98    target: final_output
99    expected_from: literal:approved
100    config: {}
101```
102
103Fields:
104
105- `key`: required, trimmed, and unique.
106- `type`: required and must be one of the supported validator types below.
107- `target`: required supported evidence reference.
108- `expected_from`: required for most validators; omitted for `file_exists`, `file_json_schema`, `directory_structure`, `code_execution`, `tool_call_assertion`, and `postcondition`.
109- `config`: optional JSON/YAML object interpreted by the validator type.
110
111There is no validator-level `failure_message`, `pass_message`, or custom result text field. Results emit `state`, `verdict`, `normalized_score`, `reason`, `raw_output`, `target`, `expected_from`, `actual_value`, and `expected_value`; the reason text is produced by the scorer.
112
113## Supported Validator Types
114These are the exact validator type strings accepted by the scoring model:
115
116```text
117exact_match
118contains
119regex_match
120json_schema
121json_path_match
122boolean_assert
123fuzzy_match
124numeric_match
125normalized_match
126token_f1
127math_equivalence
128bleu_score
129rouge_score
130chrf_score
131file_content_match
132file_exists
133file_json_schema
134directory_structure
135code_execution
136tool_call_assertion
137postcondition
138```
139
140Do not use `has_json`, `json_equals`, `semantic_match`, `unit_test`, `shell`, or provider-specific names; the validator rejects unknown `type` values.
141
142## Evidence References
143Validator `target` and required `expected_from` values must use supported evidence references:
144
145- `final_output`
146- `run.final_output`
147- `challenge_input`
148- `case.payload`
149- `case.payload.<field>`
150- `case.inputs.<input_key>`
151- `case.expectations.<expectation_key>`
152- `artifact.<artifact_key>[.<field>]`
153- `file:<post_execution_check_key>`
154- `literal:<value>`
155- `tool_calls` (only for `tool_call_assertion`)
156
157## Postconditions
158Use `postcondition` to score post-run captured files or directory listings without shelling out through `code_execution`. It must use `target: file:<post_execution_check_key>` and omit `expected_from`. Its strict `config.condition` supports `exists`, `not_exists`, `contains`, `not_contains`, `regex_match`, `json_path_match`, and `equals`.
159
160## Tool Call Assertions
161Use `tool_call_assertion` to score executed tool-call traces without asking the final answer to self-report behavior. It must use `target: tool_calls` and does not use `expected_from`.
162
163```yaml
164validators:
165  - key: submitted_answer
166    type: tool_call_assertion
167    target: tool_calls
168    config:
169      tool_name: submit
170      must_call: true
171      arguments_contain:
172        answer: "42"
173```
174
175Config supports `tool_name`, `must_call`, `count`, `min_count`, `max_count`, `arguments_contain`, `ordered_tools`, and `order_mode`. `order_mode` is `subsequence` by default and can be `exact`. Scorecard evidence includes counts, matched indices, and tool names, but not raw tool arguments.
176
177Use `literal:` for inline expected values. Use `case.expectations.<key>` or `artifact.<artifact_key>.path` when the expected value should come from case evidence rather than the skill text.
178
179## Common Text And JSON Validators
180Use these when final output or case evidence is already text or JSON.
181
182```yaml
183validators:
184  - key: exact_decision
185    type: exact_match
186    target: case.payload.expected_decision
187    expected_from: literal:approve
188
189  - key: contains_policy_term
190    type: contains
191    target: final_output
192    expected_from: literal:refund window
193
194  - key: matches_ticket_pattern
195    type: regex_match
196    target: final_output
197    expected_from: literal:TICKET-[0-9]+
198
199  - key: response_is_schema_valid
200    type: json_schema
201    target: final_output
202    expected_from: 'literal:{"type":"object","required":["decision"],"properties":{"decision":{"type":"string"}}}'
203
204  - key: decision_is_approved
205    type: json_path_match
206    target: final_output
207    expected_from: 'literal:{"path":"$.decision","comparator":"equals","value":"approve"}'
208
209  - key: escalation_flag
210    type: boolean_assert
211    target: case.payload.should_escalate
212    expected_from: literal:true
213```
214
215`json_path_match` expected values are either a JSON object with `path`, optional `comparator`, and optional `value`, or a path string that starts with `$` for an existence check. Supported comparators are `equals`, `contains`, `greater_than`, `less_than`, and `exists`.
216
217## Similarity, Numeric, And Math Validators
218These validators accept typed `config` fields.
219
220```yaml
221validators:
222  - key: answer_fuzzy
223    type: fuzzy_match
224    target: final_output
225    expected_from: case.expectations.answer
226    config:
227      threshold: 0.85
228      case_insensitive: true
229      normalize: true
230
231  - key: total_matches
232    type: numeric_match
233    target: case.payload.agent_total
234    expected_from: case.expectations.expected_total
235    config:
236      absolute_tolerance: 0.01
237      extract_number: true
238
239  - key: normalized_phrase
240    type: normalized_match
241    target: final_output
242    expected_from: literal:refund window is 30 days
243    config:
244      pipeline:
245        - trim
246        - lowercase
247        - collapse_whitespace
248
249  - key: token_overlap
250    type: token_f1
251    target: final_output
252    expected_from: case.expectations.answer
253    config:
254      threshold: 0.75
255      normalize: true
256      remove_articles: true
257      remove_punctuation: true
258
259  - key: formula_equivalent
260    type: math_equivalence
261    target: final_output
262    expected_from: literal:x^2 + 2*x + 1
263    config:
264      comparison_mode: symbolic
265```
266
267Source-backed config notes:
268
269- `fuzzy_match.threshold` and `token_f1.threshold` must be between `0` and `1` when set.
270- `numeric_match` accepts `absolute_tolerance`, `relative_tolerance`, `extract_number`, `significant_digits`, `tolerance_mode`, and `tolerance`; tolerances must be non-negative, and `significant_digits` must be greater than `0` when set.
271- `normalized_match.pipeline` accepts `trim`, `lowercase`, `collapse_whitespace`, `strip_punctuation`, `strip_currency`, `strip_formatting`, `normalize_unicode`, `remove_articles`, `sort_words`, and `sort_lines`.
272- `math_equivalence.comparison_mode` must be `symbolic` or `numeric`; `tolerance` must be non-negative.
273
274## Generation-Style Validators
275Use these for text similarity against references when exact wording is not required.
276
277```yaml
278validators:
279  - key: bleu_reference_overlap
280    type: bleu_score
281    target: final_output
282    expected_from: case.expectations.answer
283    config:
284      threshold: 0.4
285      max_ngram: 4
286      smoothing: method1
287
288  - key: rouge_summary_overlap
289    type: rouge_score
290    target: final_output
291    expected_from: case.expectations.answer
292    config:
293      threshold: 0.5
294      variant: rouge-l
295
296  - key: chrf_summary_overlap
297    type: chrf_score
298    target: final_output
299    expected_from: case.expectations.answer
300    config:
301      threshold: 0.5
302      char_order: 6
303```
304
305Config validation:
306
307- `bleu_score.smoothing` must be `none` or `method1`; `max_ngram` must be greater than `0`.
308- `rouge_score.variant` must be `rouge-1`, `rouge-2`, or `rouge-l`; `beta` must be greater than `0` when set.
309- `chrf_score.char_order` and `chrf_score.beta` must be greater than `0` when set.
310
311## File And Directory Validators
312File validators must use `target: file:<post_execution_check_key>`. Declare the capture first with `version.evaluation_spec.post_execution_checks`.
313
314```yaml
315version:
316  execution_mode: native
317  tool_policy:
318    allowed_tool_kinds:
319      - file
320      - build
321  evaluation_spec:
322    name: file-scoring
323    version_number: 1
324    judge_mode: deterministic
325    post_execution_checks:
326      - key: generated_summary
327        type: file_capture
328        path: /workspace/summary.json
329      - key: project_listing
330        type: directory_listing
331        path: /workspace
332        recursive: true
333    validators:
334      - key: summary_exists
335        type: file_exists
336        target: file:generated_summary
337      - key: summary_matches_schema
338        type: file_json_schema
339        target: file:generated_summary
340        config:
341          schema:
342            type: object
343            required:
344              - decision
345      - key: no_secret_file
346        type: directory_structure
347        target: file:project_listing
348        config:
349          forbidden_files:
350            - .env
351      - key: summary_mentions_decision
352        type: file_content_match
353        target: file:generated_summary
354        expected_from: literal:decision
355        config:
356          match_mode: contains
357```
358
359File validator rules:
360
361- `file_content_match` requires `expected_from` and supports `match_mode`: `exact`, `contains`, `regex`, `not_contains`, or `json_equal`; default is `contains`.
362- `file_exists` defaults to `must_exist: true`; set `config.must_exist: false` when the file must be absent.
363- `file_json_schema` requires `config.schema`.
364- `directory_structure` requires config and supports `required_files`, `forbidden_files`, and `required_directories`.
365- If any validator target starts with `file:` and checks are declared, the key must match a `post_execution_checks[].key`.
366
367## Code Execution Validator
368`code_execution` is a file validator. Its `target` must reference a `file_capture`, not a `directory_listing`, and `config.test_command` is required.
369
370```yaml
371post_execution_checks:
372  - key: generated_code
373    type: file_capture
374    path: /workspace/app.py
375validators:
376  - key: generated_code_tests
377    type: code_execution
378    target: file:generated_code
379    config:
380      test_command: python -m pytest tests/ -q
381      timeout_ms: 30000
382      scoring: fraction_passed
383      pass_threshold: 0.8
384```
385
386Source-backed config:
387
388- `test_command`: required non-empty string.
389- `timeout_ms`: optional integer greater than `0`.
390- `scoring`: `fraction_passed` or `all_or_nothing`; `pass_at_k` is defined but currently rejected.
391- `pass_threshold`: optional number between `0` and `1`; default effective threshold is `1.0`.
392
393## Metrics
394Metrics have `key`, `type`, `collector`, and optional `unit`.
395
396```yaml
397metrics:
398  - key: latency_ms
399    type: numeric
400    collector: run_total_latency_ms
401    unit: ms
402  - key: validator_rate
403    type: numeric
404    collector: validator_pass_rate
405```
406
407Metric `type` must be `numeric`, `text`, or `boolean`. The schema accepts `text`, but the current implemented collectors produce numeric or boolean values. The scorer currently implements these collectors:
408
409```text
410run_total_latency_ms
411run_ttft_ms
412run_input_tokens
413run_output_tokens
414run_total_tokens
415run_tool_call_count
416run_agent_tokens
417run_race_context_tokens
418run_model_cost_usd
419run_completed_successfully
420run_failure_count
421behavioral_recovery_score
422behavioral_exploration_efficiency_score
423behavioral_error_cascade_score
424behavioral_scope_adherence_score
425behavioral_confidence_calibration_score
426validator_pass_rate
427```
428
429Validation rejects `behavioral_confidence_calibration_score` for metrics until confidence reporting lands, even though the engine has a collector branch. Avoid it in new packs.
430
431## Scorecard Dimensions
432Use object-form dimensions for source-fidelity and explicit routing.
433
434```yaml
435scorecard:
436  strategy: weighted
437  pass_threshold: 0.8
438  dimensions:
439    - key: correctness
440      source: validators
441      validators:
442        - mentions_refund_window
443        - decision_is_approved
444      weight: 0.8
445      gate: true
446      pass_threshold: 0.9
447    - key: speed
448      source: metric
449      metric: latency_ms
450      better_direction: lower
451      normalization:
452        target: 1000
453        max: 60000
454      weight: 0.2
455```
456
457Dimension fields:
458
459- `key`: required and unique.
460- `source`: `validators`, `metric`, `reliability`, `latency`, `cost`, `behavioral`, or `llm_judge`.
461- `validators`: optional list of validator keys when `source: validators`; omitted means average all validators.
462- `metric`: required when `source: metric` and must reference `metrics[].key`.
463- `better_direction`: required for `metric`, `latency`, and `cost`; must be `higher` or `lower`.
464- `normalization.target` and `normalization.max`: required for `metric`, `latency`, and `cost`.
465- `weight`: optional and must be greater than or equal to `0`.
466- `gate`: optional boolean.
467- `pass_threshold`: required when `gate: true` or when `strategy: binary`; must be between `0` and `1`.
468- `judge_key`: only valid when `source: llm_judge`; use the LLM judges skill for that path.
469
470Strategy rules:
471
472- Missing `strategy` defaults to `weighted`.
473- `weighted`: optional scorecard-level `pass_threshold`; explicit gates are allowed.
474- `binary`: every dimension is implicitly gated, every dimension needs `pass_threshold`, and scorecard-level `pass_threshold` must not be set.
475- `hybrid`: requires at least one `gate: true`; gates must pass and the non-gate weighted average must clear any scorecard-level threshold.
476
477## Result Interpretation
478Validator results can be:
479
480- `verdict: pass`: evidence was available and the validator condition passed.
481- `verdict: fail`: evidence was available and the condition failed.
482- `verdict: error`: evidence existed but parsing, config, regex, schema, JSONPath, or execution-result interpretation errored.
483- unavailable state with no verdict: target or expected evidence could not be resolved.
484
485Each available validator contributes `normalized_score` on a `0..1` scale. `source: validators` dimensions average the scoped validator scores; if scoped validators are unavailable, the dimension is unavailable.
486
487## Common Validation And Scoring Failures
488- `validators` is empty.
489- Duplicate validator `key`.
490- Unknown validator type such as `has_json`.
491- Missing `target`.
492- Missing `expected_from` for a validator that requires it.
493- `target` or `expected_from` is not a supported evidence reference.
494- A file validator targets `final_output` instead of `file:<post_execution_check_key>`.
495- A `file:` target references a missing `post_execution_checks` key.
496- `code_execution` targets a `directory_listing`.
497- `file_json_schema` omits `config.schema`; this becomes a scoring error if it slips past pack validation.
498- `directory_structure` omits `config`; this becomes a scoring error if it slips past pack validation.
499- `code_execution` omits `config.test_command`; validation catches this when config is present, and scoring cannot produce a useful result without it.
500- `metric` dimensions omit `normalization`.
501- `binary` strategy sets scorecard-level `pass_threshold`.
502- `hybrid` strategy has no gated dimension.
503- A non-`llm_judge` dimension includes `judge_key`.
504
505## Authoring Procedure
5061. Identify the evidence source for each behavior: final output, case payload, case input, expectation, artifact metadata, captured file, or literal.
5072. Pick the simplest supported validator type that proves the claim.
5083. Add `expected_from` unless the validator type explicitly does not require it.
5094. Keep file checks under `post_execution_checks` and target them with `file:<key>`.
5105. Group validators into scorecard dimensions with `source: validators`.
5116. Add numeric metrics only when the scorecard needs latency, cost, token, tool, completion, failure, or pass-rate signals.
5127. Add gates and pass thresholds only for hard requirements.
5138. Run `agentclash challenge-pack validate path/to/pack.yaml --json` and fix every returned field error.
5149. Report which validators are scored, which are gates, and which evidence refs each one reads.
515
516## Safety
517- Do not put secrets in `literal:` expected values, captured files, artifact metadata, or validator config.
518- Keep `file_capture` paths narrow; captured content becomes scoring evidence.
519- Prefer deterministic fixture expectations over live mutable data.
520- Use regex and JSONPath carefully so failures explain behavior instead of implementation trivia.
521- Avoid scoring on private customer data unless retention and access are approved.
522
523## Report Back Format
524```text
525Evaluation spec:
526Validator summary:
527- key:
528  type:
529  target:
530  expected_from:
531  config:
532  score dimension:
533  gate: <yes/no>
534Metrics:
535Scorecard:
536- strategy:
537- dimensions:
538File captures:
539Evidence references:
540Validation command:
541Validation result:
542Expected result fields:
543Open issues:
544```
545
546## Related Skills
547- `agentclash-challenge-pack-input-sets`
548- `agentclash-challenge-pack-artifacts`
549- `agentclash-challenge-pack-tools-sandbox`
550- `agentclash-challenge-pack-llm-judges`
551- `agentclash-challenge-pack-validation-publish`
552- `agentclash-eval-runner`
553- `agentclash-scorecard-reader`