Challenge packs
Multi-turn challenge packs
Hybrid multi_turn execution with scripted, LLM, and human user-simulator phases, operator APIs, and calibration reviews.
Hybrid multi_turn execution runs a conversation loop outside the native multi-step tool loop. Each case declares a user_simulator manifest with scripted, LLM, and human phases.
Note
Human phases block until an operator submits a turn via API or CLI. The run-agent replay page shows an awaiting-human banner while the workflow waits.
Execution flow
- Worker routes
execution_mode: multi_turntoMultiTurnExecutor. - Phases emit
turn.*events (scripted messages, LLM-simulated user, human takeover). - Human phases block on operator input until timeout.
- Scoring builds a transcript from events and evaluates
recovery_behaviorplus optionalhuman_preferencefrom arena votes.
Operator APIs
| Method | Path | Purpose |
|---|---|---|
POST | /v1/workspaces/{ws}/runs/{runId}/run-agents/{runAgentId}/turns | Submit human user message |
GET | /v1/workspaces/{ws}/runs/{runId}/run-agents/{runAgentId}/turns/status | Poll awaiting-human state |
POST | /v1/workspaces/{ws}/calibration-reviews | Record H2 calibration score (1–5) |
GET | /v1/workspaces/{ws}/calibration-reviews | List recent calibration reviews |
GET | /v1/workspaces/{ws}/arena/tasks | List pending pairwise arena tasks |
POST | /v1/workspaces/{ws}/arena/votes | Submit arena preference vote |
CLI
1export AGENTCLASH_WORKSPACE="<workspace-id>"
2
3# While a run agent is executing and awaiting human input:
4agentclash run turn status <runAgentId> --run <runId>
5agentclash run turn submit <runAgentId> --run <runId> --message "Fine, email me when it posts."Reference pack
Publish and run examples/challenge-packs/multi-turn-refund-recovery.yaml for an end-to-end smoke test. The pack demonstrates scripted escalation, LLM user simulation, and a human takeover phase.
Web UI
The run-agent replay page groups steps by turn_index, shows mismatch badges, and surfaces an awaiting-human banner with a submit form while the agent is executing.
See also
- Bundle YAML reference — set
execution_mode: multi_turn - Input sets & cases — case payloads and
user_simulatormanifest - Replay and scorecards — read turn-grouped evidence