Challenge packs

Multi-turn challenge packs

Hybrid multi_turn execution with scripted, LLM, and human user-simulator phases, operator APIs, and calibration reviews.

multi_turn is one of the four pack execution_mode values (native, prompt_eval, responses, multi_turn). Hybrid multi_turn execution runs a conversation loop outside the native multi-step tool loop. Each case declares a user_simulator manifest with scripted, LLM, and human phases.

Note

Human phases block until an operator submits a turn via API or CLI. The run-agent replay page shows an awaiting-human banner while the workflow waits.

Execution flow

Packs with execution_mode: multi_turn run the conversation loop instead of the native tool loop (handled by the worker's multi-turn executor).
Phases emit turn.* events (scripted messages, LLM-simulated user, human takeover).
Human phases block on operator input until timeout.
Scoring builds a transcript from events and evaluates recovery_behavior plus optional human_preference from arena votes.

Operator APIs

Method	Path	Purpose
`POST`	`/v1/workspaces/{ws}/runs/{runId}/run-agents/{runAgentId}/turns`	Submit human user message
`GET`	`/v1/workspaces/{ws}/runs/{runId}/run-agents/{runAgentId}/turns/status`	Poll awaiting-human state
`POST`	`/v1/workspaces/{ws}/calibration-reviews`	Record a human calibration score (1–5) for a sampled run
`GET`	`/v1/workspaces/{ws}/calibration-reviews`	List recent calibration reviews
`GET`	`/v1/workspaces/{ws}/arena/tasks`	List pending pairwise arena tasks
`POST`	`/v1/workspaces/{ws}/arena/votes`	Submit arena preference vote

CLI

bash

1export AGENTCLASH_WORKSPACE="<workspace-id>"
2
3# While a run agent is executing and awaiting human input:
4agentclash run turn status <runAgentId> --run <runId>
5agentclash run turn submit <runAgentId> --run <runId> --message "Fine, email me when it posts."

Reference pack

Publish and run examples/challenge-packs/multi-turn-refund-recovery.yaml for an end-to-end smoke test. The pack demonstrates scripted escalation, LLM user simulation, and a human takeover phase.

The user_simulator manifest on a case looks like this (lifted from the reference pack):

yaml

1user_simulator:
2  schema_version: 1
3  kind: hybrid
4  max_turns: 12
5  phases:
6    - id: open
7      actor: scripted
8      turns:
9        - message: "I need a refund for order {{order_id}} right now."
10          expects:
11            - key: acknowledges_refund
12              kind: contains
13              value: refund
14    - id: pushback
15      actor: scripted
16      trigger: on_assistant_mismatch
17      turns:
18        - message: "You quoted the wrong policy. I want a full refund today."
19    - id: llm_escalation
20      actor: llm
21      trigger: on_assistant_mismatch
22      persona: "Frustrated customer who will accept a clear timeline if empathetic"
23      max_turns: 3
24      until:
25        - "assistant_emitted:timeline"
26    - id: human_takeover
27      actor: human
28      trigger: manual
29      timeout_ms: 900000
30  calibration:
31    enabled: true
32    sample_rate: 0.1
33  post_run:
34    arena:
35      enabled: true
36      comparison: pairwise

Web UI

The run-agent replay page groups steps by turn_index, shows mismatch badges, and surfaces an awaiting-human banner with a submit form while the agent is executing.

Execution flow

Operator APIs

CLI

Reference pack

Web UI

See also