CI intelligence, prompt eval, and GitHub-native workflows
Failure clusters, regression provenance, and CI setup generators connected the eval loop to GitHub. Prompt eval CLI commands, the PR comment bot, and E2B harness runners closed the gap between local runs and production gates.
- CI & GitHub integration
- Failure taxonomy
- Prompt eval CLI
- Harness runners
What shipped
Added
- Failure cluster rollups, identity keys, taxonomy classification, and trend charts in the web UI.
- Regression provenance, validation signals, proposed-case queue, and remediation hints.
- CI setup generators with workspace page and one-click setup pull request creation.
- Prompt eval CLI commands — config validation, remote preflight, compile, follow, and GitHub Action mode.
- AgentClash PR comment bot with links back to failure review in the UI.
- E2B harness runners for Claude Code and OpenClaw agent execution.
- Production failure capture and explicit validation runs for proposed regression cases.
Improved
- CLI preserves CI curation metadata and surfaces failure taxonomy in workflow output.
- Evaluator validity signals exposed in scorecards for clearer trust boundaries.
Fixed
- Dozens of CI, regression, and failure-review edge cases hardened across API and web.
Merged pull requests
112 PRs- #814Add CLI validation for voice artifact manifests
- #812Add generic voice artifact manifest schema
- #810test: guard embedded voice schema drift
- #807feat: add voice report schema preflight CLI
- #805feat: add voice report JSON schemas
- #803docs: add generic voice artifact contracts
- #801feat: use generic provider log metrics
- #800feat: support generic voice report types
- #799fix: cross-check video sync summary counts
- #798feat: ingest voice video sync reports
- #797feat: ingest voice live continuity reports
- #795feat: ingest voice source separation reports
- #790[codex] score voice media preservation metrics
- #788feat(voice): add voice eval UI affordances
- #787[Voice evals 16] Add support-agent end-to-end smoke test
- #786[Voice evals 15] Add production-call import and redaction contract
- #785[Voice evals 14] Add live-call adapter interface with fake transport
- #783[Voice evals 13] Add CLI/API mode selection for voice evals
- #782feat(voice-evals): add voice compare gate policy
- #781feat(voice-evals): add voice replay projection
- #780feat(voice-evals): add deterministic voice scorecards
- #779feat(voice-evals): add deterministic voice validators and metrics
- #778feat(voice-evals): add text-sim execution mode
- #777feat(voice-evals): add fake voice deployment harness
- #776feat(voice-evals): add scripted user simulator
- #775feat(voice-evals): define artifact manifest
- #774feat(voice-evals): add canonical voice events
- #773test(voice-evals): add golden fixture adapters
- #772feat(voice-evals): add voice pack validation
- #771feat(voice-evals): add multimodal trace contract
- #751feat: add agent build version templates
- #750feat(cli): add run replay and compare commands
- #749feat: add embeddable scorecard shares
- #748test: harden public run share snapshots
- #747feat: add workspace public pack opt-in
- #746feat: surface model alias pricing drift
- #745feat: add postcondition validator
- #744test: tighten doctor pack lineup assertion
- #743fix: address doctor pack review feedback
- #742feat: add pack readiness doctor checks
- #741feat: add provider account smoke test
- #740feat: add model alias create flags
- #739feat: add tool-call assertion validator
- #738feat: export markdown run transcripts
- #737feat: filter streamed run events
- #736feat: add cost-per-correct scorecard metric
- #735fix: preserve Gemini thought signatures
- #734feat: add race series aggregate reports
- #733feat: add race series creation
- #731feat: add challenge pack deployment lineups
- #730feat: add seeded run creation
- #729feat: add run max-iteration overrides
- #728feat: export run events as JSONL
- #727feat: surface scorecard total cost
- #726feat: add workspace quota visibility
- #724feat: add run cancellation
- #723fix: log orphan reaper run ids
- #722fix: reap orphaned queued runs
- #664Add rich SEO schema and discovery guardrails
- #663Enrich docs main entity schema
- #662Enrich blog main entity schema
- #661Add blog article schema image
- #660Cover robots crawler policy
- #659Cover sitemap discovery routes
- #658Add docs social image alt metadata
- #657Keep page Open Graph locale explicit
- #656Add secondary page social metadata
- #655Add blog index social metadata
- #654Add platform social metadata
- #653Add blog social image metadata
- #652Add docs social metadata
- #651Add docs structured data
- #650Add blog post breadcrumb structured data
- #649Add blog index structured data
- #648Add RSS autodiscovery metadata
- #647Add blog RSS feed
- #646Include blog posts in llms discovery files
- #645Add AI agent evaluation blog post
- #644Add platform page structured data
- #643Add platform pages to docs search index
- #642List platform pages in llms discovery files
- #641Add AI agent regression testing page
- #640Add AI agent evaluation platform page
- #639Improve npm discovery metadata
- #638Clarify homepage AI agent evaluation copy
- #637Noindex markdown docs exports
- #636Add public page SEO metadata
- #635Improve homepage SEO metadata signals
- #634Remove hidden v2 marketing routes
- #630CLI: expose full Agent Harness workflow
- #629feat(cli): add workflow phase 1 commands
- #628Agent Harnesses: classify failures and curate prior runs
- #627Agent Harnesses: add suite rankings and pass@k
- #626feat: add harness execution controls
- #625feat: add agent harness task suites
- #624feat: add harness privacy controls
- #623fix: log invite accept denial reasons
- #622feat: persist harness scorecards
- #621feat: add agent harness bootstrap setup stage
- #620fix: accept invite tokens when auth email is unavailable
- #617fix: repair invite acceptance flow
- #616feat(agent-harnesses): chat-first harness workbench
- #607fix(cli): compact prompt eval results tables
- #605fix(cli): polish prompt eval results table
- #603docs: design prompt-eval failure promotion
- #602feat(cli): add safe Promptfoo import subset
- #601feat(action): add prompt-eval CI mode
- #600feat(cli): add prompt-eval follow and results
- #599feat(cli): compile and launch prompt-eval runs
- #598feat(cli): add prompt-eval remote preflight
- #596feat(cli): add prompt-eval config validation
- #580fix: add copyable member invite links