Agent Skills

Prompt Eval Playground Skill

Use when scaffolding, validating, or running prompt eval YAML configs and managing playground experiments, test cases, and prompt variants via the AgentClash CLI.

Canonical source: web/content/agent-skills/agentclash-prompt-eval-playground/SKILL.md

Markdown export: /docs-md/agent-skills/agentclash-prompt-eval-playground

Use This Skill When

Use when scaffolding, validating, or running prompt eval YAML configs and managing playground experiments, test cases, and prompt variants via the AgentClash CLI.

Full SKILL.md

markdown

1---
2name: agentclash-prompt-eval-playground
3description: Use when scaffolding, validating, or running prompt eval YAML configs and managing playground experiments, test cases, and prompt variants via the AgentClash CLI.
4metadata:
5  agentclash.role: prompt-eval
6  agentclash.version: "1"
7  agentclash.requires_cli: "true"
8---
9
10# AgentClash Prompt Eval Playground
11
12## Purpose
13Local-first prompt evaluation workflows: scaffold `.agentclash/prompt-eval.yaml`, validate locally or remotely, compile configs into playground experiments, fetch results, import Promptfoo subsets, and manage playground CRUD from the CLI.
14
15## Use When
16- Comparing prompt variants or model aliases on fixed test cases before a full challenge-pack run.
17- CI needs `prompt-eval validate --ci --remote` or `prompt-eval run --ci --follow`.
18- Migrating a Promptfoo config into AgentClash format.
19- Inspecting or rerunning playground experiments linked to prompt eval runs.
20
21## Do Not Use When
22- The eval is a full agent deployment on a challenge pack — use `agentclash-eval-runner`.
23- The workflow is dataset versioning and baseline gates — use `agentclash-dataset-workflows`.
24- The user needs harness repo tasks — use `agentclash-agent-harness-setup`.
25
26## Inputs Needed
27- Workspace with provider accounts and deployments referenced in the YAML.
28- Prompt eval config path (default `.agentclash/prompt-eval.yaml`).
29- For playground commands: playground ID from list/create output.
30- Optional Promptfoo YAML for import.
31
32## Environment
33```bash
34export AGENTCLASH_API_URL="https://api.agentclash.dev"
35agentclash workspace use <WORKSPACE_ID>
36agentclash prompt-eval init
37agentclash prompt-eval validate --remote
38```
39
40## Procedure
411. Scaffold or import a prompt eval config.
422. Validate locally; add `--remote` (and `--ci` in pipelines) for workspace-safe checks.
433. Run to compile and launch playground experiments; use `--follow` in CI.
444. Fetch results by experiment ID; compare assertion pass rates vs threshold.
455. Use `playground` subcommands for manual experiment CRUD when not driven by YAML.
46
47## Commands
48
49### Prompt eval lifecycle
50```bash
51agentclash prompt-eval init
52agentclash prompt-eval init my-eval.yaml --name "Refund prompts"
53agentclash prompt-eval validate
54agentclash prompt-eval validate --remote --ci
55agentclash prompt-eval run --follow --ci --threshold 0.95
56agentclash prompt-eval results <experiment-id> --threshold 0.95
57agentclash prompt-eval import-promptfoo promptfoo.yaml --out .agentclash/prompt-eval.yaml
58```
59
60Useful flags on `run`: `--max-cases`, `--poll-interval`, `--timeout`, `--threshold`.
61
62### Playground CRUD (alias `pg`)
63```bash
64agentclash playground list
65agentclash playground create --name "Refund A/B" --description "Tone variants"
66agentclash playground get <playground-id>
67agentclash playground update <playground-id> --name "Updated name"
68agentclash playground delete <playground-id>
69
70agentclash playground test-cases list <playground-id>
71agentclash playground test-cases create <playground-id> --input '{"messages":[...]}'
72agentclash playground test-cases update <playground-id> <case-id> --expected-file out.json
73agentclash playground test-cases delete <playground-id> <case-id>
74
75agentclash playground experiments list <playground-id>
76agentclash playground experiments create <playground-id> --prompt-variant <variant-id>
77agentclash playground experiments get <playground-id> <experiment-id>
78agentclash playground experiments run <playground-id> <experiment-id> --follow
79```
80
81Config default path: `.agentclash/prompt-eval.yaml`. Schema version is stamped on `init`.
82
83## Expected Output
84- Validate prints errors/warnings; exits non-zero when invalid.
85- Run compiles cases into experiments; `--follow` waits for completion.
86- Results envelope includes assertion pass rate; sub-threshold exits non-zero in CI mode.
87
88## Failure Modes
89- `--ci` without `--remote` on validate → CI-safe remote checks required.
90- Missing workspace references in YAML → fix provider accounts/deployments or run `--remote` validate.
91- Import Promptfoo with unsupported features → use `--lossy` or edit converted YAML manually.
92- Experiment not found → list experiments on the playground first.
93
94## Safety Notes
95- Remote validate/run touches live provider accounts — use CI workspace tokens, not personal prod keys in shared logs.
96- Promptfoo import may drop unsupported assertions — review converted YAML before merging.
97- Playground experiments incur model cost — cap `--max-cases` in exploratory runs.
98
99## Report Back Format
100```text
101Config: <path>
102Validate: <pass/fail> (<error count> errors)
103Experiment: <id or n/a>
104Pass rate: <rate vs threshold>
105Playground: <id or n/a>
106Next: agentclash eval-runner ... OR prompt-eval results <id>
107```
108
109## Related Skills
110- `agentclash-hub`
111- `agentclash-cli-setup`
112- `agentclash-eval-runner`
113- `agentclash-dataset-workflows`
114- `agentclash-scorecard-reader`
115- `agentclash-regression-flywheel`
116
117## Related Docs
118- `/docs-md/guides/use-with-ai-tools`
119- `/docs-md/reference/cli`
120- `/docs-md/getting-started/first-eval`

PreviousDataset Workflows Skill NextWorkspace Admin Skill