Scoring depth, failure review, and the first regression suite UI

The evaluation engine gained deterministic validators, LLM judge dimensions, and behavioral scoring. Failure review and regression suites landed in the workspace UI, alongside CLI distribution hardening and the xAI provider adapter.

  • Scoring & validators
  • Failure review
  • Regression suites
  • CLI distribution

What shipped

Added

  • Sandbox code-execution validator and math-equivalence scoring for deterministic checks.
  • LLM judge dimensions with n-wise comparisons, persisted results, and scorecard rationale cards in the UI.
  • Behavioral spec scoring — signal extraction, scorecards, and dimension contribution metadata.
  • Text-generation metrics and token F1 validator with evidence linked back to run events.
  • Failure review API and workspace Failures page with filters, detail drawer, and run links.
  • Regression suite CRUD, failure promotion flow, and full regression UI (suites, cases, promotion dialog).
  • xAI provider adapter wired into the execution engine.
  • Run replay step detail now surfaces model output and tool results.
  • Workspace invite emails via Resend.

Improved

  • Run-agent scorecard redesigned with an inspector-style layout.
  • Cross-platform CLI install scripts hardened for macOS, Linux, and Windows.
  • Authenticated users redirect from the landing page straight to the dashboard.

Merged pull requests

60 PRs
  1. #413[codex] fix workspace auth hydration after login
  2. #412fix(web): redirect workspace shell auth failures
  3. #411fix(auth): recover device login email from workos userinfo
  4. #410[codex] make workspace navigation instant
  5. #409fix(auth): show user identity after cli login
  6. #408feat: race-context — live peer-standings injection (Phase 2 UI)
  7. #405feat(cli): default released binaries to https://api.agentclash.dev
  8. #404(fix): cli release please scope
  9. #403feat: race-context — live peer-standings injection (Phase 1 backend + API + CLI)
  10. #401chore(cli): remove winget release channel
  11. #398Revert "feat(web): first-run onboarding for workspaces" (#392)
  12. #396[codex] Browser capability substrate
  13. #393feat: full screen YAML editor for challenge packs
  14. #392feat(web): first-run onboarding for workspaces
  15. #391feat(web): SEO preview tree at /v2 with 22 marketing pages
  16. #389feat(web): 3D hero icon with WebGL shaders
  17. #387refactor(docs): update documentation UI to match clean dark theme
  18. #386[codex] Build public docs foundation
  19. #385feat(web): add race mode toggle for live run view
  20. #384[codex] Fix live arena SSE and add commentary booth
  21. #383feat(web): live arena view for active runs
  22. #377feat(cli): add interactive run creation picker
  23. #373[codex] Complete issue #149 eval session UI
  24. #371[codex] Implement repeated-eval pass@k and comparison semantics
  25. #370[codex] Add eval session aggregation persistence
  26. #369Add eval-session workflow fan-out orchestration
  27. #368feat(eval-sessions): add repeated-eval inspection reads and verification matrix
  28. #366Add eval session creation endpoint and transactional run plumbing
  29. #364[codex] Add eval session persistence for repeated eval runs
  30. #357Add guided build authoring UX for non-expert users
  31. #356Add phase 1 ranking insights UI and API
  32. #353feat(cli): harden CLI and add npm distribution channel
  33. #344feat(web): frontend compare/gate with regressions (#327)
  34. #343Add regression rule evaluation to release gates
  35. #340Tighten regression workflow gaps before H and I
  36. #339feat: support regression suite selection in runs
  37. #338Regression workflow: add failure promotion modal
  38. #337feat(web): regression suites frontend (closes #323)
  39. #335fix(release): use supported GoReleaser publisher tokens
  40. #334feat: add regression failure promotion endpoint
  41. #333fix(release): use plain v tags for Release Please
  42. #332feat(failures): Run → Failures page (closes #320)
  43. #329Add failure review read model and run failures API
  44. #328Add regression suite and case CRUD backend
  45. #316Add xAI provider support via OpenAI-compatible adapter
  46. #313Make CLI install and release flow accessible
  47. #312fix(scoring): point final_output source at real producer, fix JSON-schema integer rejection
  48. #311Add live CLI E2E smoke suite
  49. #310feat(scorecard): link validators and metrics to originating run event
  50. #309Fix CLI silently swallowing command errors
  51. #308[codex] Expose validator evidence on scorecards
  52. #307Fix CLI browser auth flow
  53. #306[codex] Add token_f1 scoring validator
  54. #305feat(scoring): add BLEU / ROUGE / chrF validators
  55. #304Expose dimension weights and contribution in scorecard response
  56. #303feat(scoring): add behavioral analysis scorecards
  57. #298Support n-wise judges and hosted structured trace replay
  58. #293feat(scoring): LLM judge spec surface + persistence (phases 1-2)
  59. #291fix(api): auto-select input set on run creation
  60. #290fix(auth): restore secure CLI login with web verification
All releases