Challenge packs

Multi-turn challenge packs

Hybrid multi_turn execution with scripted, LLM, and human user-simulator phases, operator APIs, and calibration reviews.

Hybrid multi_turn execution runs a conversation loop outside the native multi-step tool loop. Each case declares a user_simulator manifest with scripted, LLM, and human phases.

Note

Human phases block until an operator submits a turn via API or CLI. The run-agent replay page shows an awaiting-human banner while the workflow waits.

Execution flow

  1. Worker routes execution_mode: multi_turn to MultiTurnExecutor.
  2. Phases emit turn.* events (scripted messages, LLM-simulated user, human takeover).
  3. Human phases block on operator input until timeout.
  4. Scoring builds a transcript from events and evaluates recovery_behavior plus optional human_preference from arena votes.

Operator APIs

MethodPathPurpose
POST/v1/workspaces/{ws}/runs/{runId}/run-agents/{runAgentId}/turnsSubmit human user message
GET/v1/workspaces/{ws}/runs/{runId}/run-agents/{runAgentId}/turns/statusPoll awaiting-human state
POST/v1/workspaces/{ws}/calibration-reviewsRecord H2 calibration score (1–5)
GET/v1/workspaces/{ws}/calibration-reviewsList recent calibration reviews
GET/v1/workspaces/{ws}/arena/tasksList pending pairwise arena tasks
POST/v1/workspaces/{ws}/arena/votesSubmit arena preference vote

CLI

bash
1export AGENTCLASH_WORKSPACE="<workspace-id>"
2
3# While a run agent is executing and awaiting human input:
4agentclash run turn status <runAgentId> --run <runId>
5agentclash run turn submit <runAgentId> --run <runId> --message "Fine, email me when it posts."

Reference pack

Publish and run examples/challenge-packs/multi-turn-refund-recovery.yaml for an end-to-end smoke test. The pack demonstrates scripted escalation, LLM user simulation, and a human takeover phase.

Web UI

The run-agent replay page groups steps by turn_index, shows mismatch badges, and surfaces an awaiting-human banner with a submit form while the agent is executing.

See also