May 15 – May 24, 2026

Security packs, multi-turn eval, and replay polish

Security-family challenge packs, stress-run CLI, and vault boundary harnesses established AgentClash as a security eval surface. Multi-turn evaluation with human takeover and case templating extended the engine beyond single-shot runs.

Security eval
Multi-turn eval
Case templating
Replay polish

What shipped

Added

Multi-turn evaluation — user simulators, hybrid executor, human takeover, calibration reviews, and post-run arena.
Case template renderer and bundle validation for code-execution challenge packs.
Multi-turn conversation events with transcript helpers in the event model.

Improved

Replay trace output upgraded with IDE-level syntax highlighting and rich text rendering.
Planted secrets from security packs wired into sandbox provisioning.

Security

SecurityPolicy schema and SecurityScore dimension for security-family challenge packs.
Canonical secret-hygiene and prompt-injection challenge packs shipped.
CLI `stress-run` subcommand with Anthropic Messages provider and --no-system-guard leak surfacing.
agent-vault-stress harness with real Vault SDK, function calling, campaign mode, and bundled HTTP mock.
Infisical and HashiCorp Vault boundary packs for vault-framed canary leak testing.

Merged pull requests

21 PRs