Security packs, multi-turn eval, and replay polish

Security-family challenge packs, stress-run CLI, and vault boundary harnesses established AgentClash as a security eval surface. Multi-turn evaluation with human takeover and case templating extended the engine beyond single-shot runs.

  • Security eval
  • Multi-turn eval
  • Case templating
  • Replay polish

What shipped

Added

  • Multi-turn evaluation — user simulators, hybrid executor, human takeover, calibration reviews, and post-run arena.
  • Case template renderer and bundle validation for code-execution challenge packs.
  • Multi-turn conversation events with transcript helpers in the event model.

Improved

  • Replay trace output upgraded with IDE-level syntax highlighting and rich text rendering.
  • Planted secrets from security packs wired into sandbox provisioning.

Security

  • SecurityPolicy schema and SecurityScore dimension for security-family challenge packs.
  • Canonical secret-hygiene and prompt-injection challenge packs shipped.
  • CLI `stress-run` subcommand with Anthropic Messages provider and --no-system-guard leak surfacing.
  • agent-vault-stress harness with real Vault SDK, function calling, campaign mode, and bundled HTTP mock.
  • Infisical and HashiCorp Vault boundary packs for vault-framed canary leak testing.

Merged pull requests

21 PRs
  1. #856feat: add opencode PR review bot with intelligent model selection
  2. #855fix(multi-turn): preserve submit tool history and human-input UI
  3. #853feat(runevents): multi-turn run events + transcript model (#842)
  4. #852feat(challengepack): user_simulator manifest schema + validation (#841)
  5. #851feat(challengepack): per-case {{placeholder}} templating for test_command (#840)
  6. #838feat(replay): IDE-level syntax highlighting and rich text rendering in trace output
  7. #832feat(security): runtime-stress harness — real Vault SDK + function calling (#815)
  8. #831docs(security): comprehensive security-eval findings + methodology (#815)
  9. #830fix(packs): anchor forbidden_output patterns to compliance-context (#815)
  10. #829fix(stress): Anthropic empty-content refusal → synthetic refusal marker (#815)
  11. #827feat(packs): hashicorp-vault-boundary.yaml — Vault-framed canary pack (#815)
  12. #826feat(packs): infisical-boundary.yaml — does naming the vault hold the line? (#815)
  13. #825feat(stress): Anthropic Messages API provider + frontier-model leak data (#815)
  14. #824feat(stress): --no-system-guard surfaces real leaks (gpt-4o-mini 100% leak at 15 iter) (#815)
  15. #823test(stress): prove scorer fires per-incident-kind under stub-LLM leak injection (#815)
  16. #822fix(packs): broaden refusal-patterns after real stress-run calibration (#815)
  17. #820feat(cli): security stress-run subcommand (PR 5/10 — #815)
  18. #819feat(examples): prompt-injection-classic canonical pack (PR 4/10 — #815)
  19. #818feat(examples): canonical secret-hygiene-env security pack (PR 3/10 — #815)
  20. #817feat(securityscore): pure scorer for SecurityPolicy (PR 2/10 — #815)
  21. #816feat(challengepack): SecurityPolicy schema for security-family packs (PR 1/10 — #815)
All releases