May 05 – May 14, 2026

CI intelligence, prompt eval, and GitHub-native workflows

Failure clusters, regression provenance, and CI setup generators connected the eval loop to GitHub. Prompt eval CLI commands, the PR comment bot, and E2B harness runners closed the gap between local runs and production gates.

CI & GitHub integration
Failure taxonomy
Prompt eval CLI
Harness runners

What shipped

Added

Failure cluster rollups, identity keys, taxonomy classification, and trend charts in the web UI.
Regression provenance, validation signals, proposed-case queue, and remediation hints.
CI setup generators with workspace page and one-click setup pull request creation.
Prompt eval CLI commands — config validation, remote preflight, compile, follow, and GitHub Action mode.
AgentClash PR comment bot with links back to failure review in the UI.
E2B harness runners for Claude Code and OpenClaw agent execution.
Production failure capture and explicit validation runs for proposed regression cases.

Improved

CLI preserves CI curation metadata and surfaces failure taxonomy in workflow output.
Evaluator validity signals exposed in scorecards for clearer trust boundaries.

Fixed

Dozens of CI, regression, and failure-review edge cases hardened across API and web.

Merged pull requests

112 PRs