CI intelligence, prompt eval, and GitHub-native workflows

Failure clusters, regression provenance, and CI setup generators connected the eval loop to GitHub. Prompt eval CLI commands, the PR comment bot, and E2B harness runners closed the gap between local runs and production gates.

  • CI & GitHub integration
  • Failure taxonomy
  • Prompt eval CLI
  • Harness runners

What shipped

Added

  • Failure cluster rollups, identity keys, taxonomy classification, and trend charts in the web UI.
  • Regression provenance, validation signals, proposed-case queue, and remediation hints.
  • CI setup generators with workspace page and one-click setup pull request creation.
  • Prompt eval CLI commands — config validation, remote preflight, compile, follow, and GitHub Action mode.
  • AgentClash PR comment bot with links back to failure review in the UI.
  • E2B harness runners for Claude Code and OpenClaw agent execution.
  • Production failure capture and explicit validation runs for proposed regression cases.

Improved

  • CLI preserves CI curation metadata and surfaces failure taxonomy in workflow output.
  • Evaluator validity signals exposed in scorecards for clearer trust boundaries.

Fixed

  • Dozens of CI, regression, and failure-review edge cases hardened across API and web.

Merged pull requests

112 PRs
  1. #814Add CLI validation for voice artifact manifests
  2. #812Add generic voice artifact manifest schema
  3. #810test: guard embedded voice schema drift
  4. #807feat: add voice report schema preflight CLI
  5. #805feat: add voice report JSON schemas
  6. #803docs: add generic voice artifact contracts
  7. #801feat: use generic provider log metrics
  8. #800feat: support generic voice report types
  9. #799fix: cross-check video sync summary counts
  10. #798feat: ingest voice video sync reports
  11. #797feat: ingest voice live continuity reports
  12. #795feat: ingest voice source separation reports
  13. #790[codex] score voice media preservation metrics
  14. #788feat(voice): add voice eval UI affordances
  15. #787[Voice evals 16] Add support-agent end-to-end smoke test
  16. #786[Voice evals 15] Add production-call import and redaction contract
  17. #785[Voice evals 14] Add live-call adapter interface with fake transport
  18. #783[Voice evals 13] Add CLI/API mode selection for voice evals
  19. #782feat(voice-evals): add voice compare gate policy
  20. #781feat(voice-evals): add voice replay projection
  21. #780feat(voice-evals): add deterministic voice scorecards
  22. #779feat(voice-evals): add deterministic voice validators and metrics
  23. #778feat(voice-evals): add text-sim execution mode
  24. #777feat(voice-evals): add fake voice deployment harness
  25. #776feat(voice-evals): add scripted user simulator
  26. #775feat(voice-evals): define artifact manifest
  27. #774feat(voice-evals): add canonical voice events
  28. #773test(voice-evals): add golden fixture adapters
  29. #772feat(voice-evals): add voice pack validation
  30. #771feat(voice-evals): add multimodal trace contract
  31. #751feat: add agent build version templates
  32. #750feat(cli): add run replay and compare commands
  33. #749feat: add embeddable scorecard shares
  34. #748test: harden public run share snapshots
  35. #747feat: add workspace public pack opt-in
  36. #746feat: surface model alias pricing drift
  37. #745feat: add postcondition validator
  38. #744test: tighten doctor pack lineup assertion
  39. #743fix: address doctor pack review feedback
  40. #742feat: add pack readiness doctor checks
  41. #741feat: add provider account smoke test
  42. #740feat: add model alias create flags
  43. #739feat: add tool-call assertion validator
  44. #738feat: export markdown run transcripts
  45. #737feat: filter streamed run events
  46. #736feat: add cost-per-correct scorecard metric
  47. #735fix: preserve Gemini thought signatures
  48. #734feat: add race series aggregate reports
  49. #733feat: add race series creation
  50. #731feat: add challenge pack deployment lineups
  51. #730feat: add seeded run creation
  52. #729feat: add run max-iteration overrides
  53. #728feat: export run events as JSONL
  54. #727feat: surface scorecard total cost
  55. #726feat: add workspace quota visibility
  56. #724feat: add run cancellation
  57. #723fix: log orphan reaper run ids
  58. #722fix: reap orphaned queued runs
  59. #664Add rich SEO schema and discovery guardrails
  60. #663Enrich docs main entity schema
  61. #662Enrich blog main entity schema
  62. #661Add blog article schema image
  63. #660Cover robots crawler policy
  64. #659Cover sitemap discovery routes
  65. #658Add docs social image alt metadata
  66. #657Keep page Open Graph locale explicit
  67. #656Add secondary page social metadata
  68. #655Add blog index social metadata
  69. #654Add platform social metadata
  70. #653Add blog social image metadata
  71. #652Add docs social metadata
  72. #651Add docs structured data
  73. #650Add blog post breadcrumb structured data
  74. #649Add blog index structured data
  75. #648Add RSS autodiscovery metadata
  76. #647Add blog RSS feed
  77. #646Include blog posts in llms discovery files
  78. #645Add AI agent evaluation blog post
  79. #644Add platform page structured data
  80. #643Add platform pages to docs search index
  81. #642List platform pages in llms discovery files
  82. #641Add AI agent regression testing page
  83. #640Add AI agent evaluation platform page
  84. #639Improve npm discovery metadata
  85. #638Clarify homepage AI agent evaluation copy
  86. #637Noindex markdown docs exports
  87. #636Add public page SEO metadata
  88. #635Improve homepage SEO metadata signals
  89. #634Remove hidden v2 marketing routes
  90. #630CLI: expose full Agent Harness workflow
  91. #629feat(cli): add workflow phase 1 commands
  92. #628Agent Harnesses: classify failures and curate prior runs
  93. #627Agent Harnesses: add suite rankings and pass@k
  94. #626feat: add harness execution controls
  95. #625feat: add agent harness task suites
  96. #624feat: add harness privacy controls
  97. #623fix: log invite accept denial reasons
  98. #622feat: persist harness scorecards
  99. #621feat: add agent harness bootstrap setup stage
  100. #620fix: accept invite tokens when auth email is unavailable
  101. #617fix: repair invite acceptance flow
  102. #616feat(agent-harnesses): chat-first harness workbench
  103. #607fix(cli): compact prompt eval results tables
  104. #605fix(cli): polish prompt eval results table
  105. #603docs: design prompt-eval failure promotion
  106. #602feat(cli): add safe Promptfoo import subset
  107. #601feat(action): add prompt-eval CI mode
  108. #600feat(cli): add prompt-eval follow and results
  109. #599feat(cli): compile and launch prompt-eval runs
  110. #598feat(cli): add prompt-eval remote preflight
  111. #596feat(cli): add prompt-eval config validation
  112. #580fix: add copyable member invite links
All releases