Apr 15 – Apr 24, 2026

Scoring depth, failure review, and the first regression suite UI

The evaluation engine gained deterministic validators, LLM judge dimensions, and behavioral scoring. Failure review and regression suites landed in the workspace UI, alongside CLI distribution hardening and the xAI provider adapter.

Scoring & validators
Failure review
Regression suites
CLI distribution

What shipped

Added

Sandbox code-execution validator and math-equivalence scoring for deterministic checks.
LLM judge dimensions with n-wise comparisons, persisted results, and scorecard rationale cards in the UI.
Behavioral spec scoring — signal extraction, scorecards, and dimension contribution metadata.
Text-generation metrics and token F1 validator with evidence linked back to run events.
Failure review API and workspace Failures page with filters, detail drawer, and run links.
Regression suite CRUD, failure promotion flow, and full regression UI (suites, cases, promotion dialog).
xAI provider adapter wired into the execution engine.
Run replay step detail now surfaces model output and tool results.
Workspace invite emails via Resend.

Improved

Run-agent scorecard redesigned with an inspector-style layout.
Cross-platform CLI install scripts hardened for macOS, Linux, and Windows.
Authenticated users redirect from the landing page straight to the dashboard.

Merged pull requests

60 PRs