Scoring depth, failure review, and the first regression suite UI
The evaluation engine gained deterministic validators, LLM judge dimensions, and behavioral scoring. Failure review and regression suites landed in the workspace UI, alongside CLI distribution hardening and the xAI provider adapter.
- Scoring & validators
- Failure review
- Regression suites
- CLI distribution
What shipped
Added
- Sandbox code-execution validator and math-equivalence scoring for deterministic checks.
- LLM judge dimensions with n-wise comparisons, persisted results, and scorecard rationale cards in the UI.
- Behavioral spec scoring — signal extraction, scorecards, and dimension contribution metadata.
- Text-generation metrics and token F1 validator with evidence linked back to run events.
- Failure review API and workspace Failures page with filters, detail drawer, and run links.
- Regression suite CRUD, failure promotion flow, and full regression UI (suites, cases, promotion dialog).
- xAI provider adapter wired into the execution engine.
- Run replay step detail now surfaces model output and tool results.
- Workspace invite emails via Resend.
Improved
- Run-agent scorecard redesigned with an inspector-style layout.
- Cross-platform CLI install scripts hardened for macOS, Linux, and Windows.
- Authenticated users redirect from the landing page straight to the dashboard.
Merged pull requests
60 PRs- #413[codex] fix workspace auth hydration after login
- #412fix(web): redirect workspace shell auth failures
- #411fix(auth): recover device login email from workos userinfo
- #410[codex] make workspace navigation instant
- #409fix(auth): show user identity after cli login
- #408feat: race-context — live peer-standings injection (Phase 2 UI)
- #405feat(cli): default released binaries to https://api.agentclash.dev
- #404(fix): cli release please scope
- #403feat: race-context — live peer-standings injection (Phase 1 backend + API + CLI)
- #401chore(cli): remove winget release channel
- #398Revert "feat(web): first-run onboarding for workspaces" (#392)
- #396[codex] Browser capability substrate
- #393feat: full screen YAML editor for challenge packs
- #392feat(web): first-run onboarding for workspaces
- #391feat(web): SEO preview tree at /v2 with 22 marketing pages
- #389feat(web): 3D hero icon with WebGL shaders
- #387refactor(docs): update documentation UI to match clean dark theme
- #386[codex] Build public docs foundation
- #385feat(web): add race mode toggle for live run view
- #384[codex] Fix live arena SSE and add commentary booth
- #383feat(web): live arena view for active runs
- #377feat(cli): add interactive run creation picker
- #373[codex] Complete issue #149 eval session UI
- #371[codex] Implement repeated-eval pass@k and comparison semantics
- #370[codex] Add eval session aggregation persistence
- #369Add eval-session workflow fan-out orchestration
- #368feat(eval-sessions): add repeated-eval inspection reads and verification matrix
- #366Add eval session creation endpoint and transactional run plumbing
- #364[codex] Add eval session persistence for repeated eval runs
- #357Add guided build authoring UX for non-expert users
- #356Add phase 1 ranking insights UI and API
- #353feat(cli): harden CLI and add npm distribution channel
- #344feat(web): frontend compare/gate with regressions (#327)
- #343Add regression rule evaluation to release gates
- #340Tighten regression workflow gaps before H and I
- #339feat: support regression suite selection in runs
- #338Regression workflow: add failure promotion modal
- #337feat(web): regression suites frontend (closes #323)
- #335fix(release): use supported GoReleaser publisher tokens
- #334feat: add regression failure promotion endpoint
- #333fix(release): use plain v tags for Release Please
- #332feat(failures): Run → Failures page (closes #320)
- #329Add failure review read model and run failures API
- #328Add regression suite and case CRUD backend
- #316Add xAI provider support via OpenAI-compatible adapter
- #313Make CLI install and release flow accessible
- #312fix(scoring): point final_output source at real producer, fix JSON-schema integer rejection
- #311Add live CLI E2E smoke suite
- #310feat(scorecard): link validators and metrics to originating run event
- #309Fix CLI silently swallowing command errors
- #308[codex] Expose validator evidence on scorecards
- #307Fix CLI browser auth flow
- #306[codex] Add token_f1 scoring validator
- #305feat(scoring): add BLEU / ROUGE / chrF validators
- #304Expose dimension weights and contribution in scorecard response
- #303feat(scoring): add behavioral analysis scorecards
- #298Support n-wise judges and hosted structured trace replay
- #293feat(scoring): LLM judge spec surface + persistence (phases 1-2)
- #291fix(api): auto-select input set on run creation
- #290fix(auth): restore secure CLI login with web verification