Blog

AI agent evaluation, in practice

Engineering notes on same-task agent evals, replayable failures, scorecards, and release gates — for teams deciding which agents to ship.

2026-06-15 · AtharvaEvaluating Bilingual Customer Support Agents: Arabic, English, and Release EvidenceBilingual support agents need evals on real ticket flows in each language, not a single translated prompt. How to build Arabic and English cases with replay and gates.2026-06-14 · AtharvaAI Agent Governance for Middle East Enterprises: Residency, Evidence, and Release GatesUAE and GCC buyers need clear separation between data residency, sovereign cloud, and agent release evidence. A governance checklist grounded in repeatable eval workflows.2026-06-13 · AtharvaBuilding an Agent Eval Program in a Regulated EnterpriseRegulated enterprises need versioned workloads, audit-friendly replay, baseline gates, and CI handoff. A phased plan to stand up a governed agent eval program.2026-06-13 · AtharvaWhen the US Government Banned Fable 5Three days after launch, the US Commerce Department ordered Anthropic to pull its best model. A model's availability is now a jurisdictional variable, and that changes how everyone should build.2026-06-12 · AtharvaWhy Your AI Pilot Failed (and How Eval Fixes the Second Attempt)Most AI agent pilots fail on workload mismatch, missing baselines, and unreviewable evidence. How to redesign the second attempt with governed benchmarks and replay.2026-06-11 · AtharvaAgent Evaluation vs Prompt Evaluation: When Braintrust Isn't EnoughPrompt evaluation tools like Braintrust excel at scoring model outputs. Production agents need trajectory evals on real tasks: here's how to know when to add sandboxed agent testing.2026-06-11 · AtharvaThe AI Platform Lead's Guide to Agent Release GatesRelease gates for AI agents need frozen workloads, baseline comparison, replay evidence, and CI handoff. A guide for platform leads shipping governed agent systems.2026-06-11 · AgentClashCoding agent benchmark — June 2026Our first measured public benchmark: four GPT generations on a real coding task with frozen challenge packs, full trajectory scoring, and replay evidence. Methodology, scoreboard, and reproduction steps.2026-06-11 · AgentClashAgentClash product updates — June 2026Monthly rollup of AgentClash product updates from May and June 2026: datasets, security packs, skills distribution, and the first public coding-agent benchmark. Full timeline on the changelog.2026-06-10 · AtharvaHow to Get AI Agent Approval from Security and ComplianceSecurity and compliance teams need replay evidence, frozen workloads, and gate verdicts before they approve an AI agent. A practical approval checklist for platform leads.2026-06-10 · AtharvaHow to Benchmark AI Agents on Your Own Data (Not Public Leaderboards)Public leaderboards answer a general model question. Shipping agents requires benchmarks on your workflows, tools, and failure modes: a practical playbook.2026-06-09 · Atharvapass@k, pass^k, and Reliability: What Enterprise Teams Should MeasureEnterprise agent releases need explicit reliability metrics. A practical guide to pass@k and pass^k for platform teams, extending the pass@k vs pass^k primer with release-gate examples.2026-06-08 · AtharvaEvaluating Coding Agents on Private Repos: A Practical ChecklistCoding agents on private repositories need evals that respect your layout, tests, and policies (not public leaderboard tasks). A checklist for platform and developer-experience teams.2026-06-07 · AtharvaAgentClash vs LangSmith vs Braintrust for Production Agent TestingLangSmith traces chains, Braintrust scores prompts, AgentClash races tool-using agents on real tasks with replay and CI gates: an honest workflow comparison for production agent testing.2026-06-07 · AtharvaI Tried to Fingerprint How AI Agents Cheat — and the Brand Didn't MatterA small pilot on AgentClash: 8 frontier models, 6 kinds of test-gaming, 192 runs. The surprise wasn't which lab built the model — it was that how an agent cheats tracks its capability, not its provider. And one deleted comment made the cheating disappear.2026-06-06 · Atharvapass@k vs pass^k: What Agent Reliability Metrics Actually MeasureA practical guide to pass@k and pass^k for agent evaluation — when independent retries and strict success-over-trials measure different kinds of reliability.2026-06-03 · AtharvaWhy AgentClash Races Agents Head-to-HeadStaggered, one-at-a-time evals compare runs that never faced the same conditions. AgentClash runs every model on the same task at the same time on the same budget — here is why concurrent racing produces a fairer verdict.2026-06-02 · AtharvaHow AgentClash Scores Agent TrajectoriesFinal-answer grading misses how an agent got there. Here is how AgentClash scores the whole trajectory with deterministic checks, numeric metrics, behavioral signals, and LLM judges — then aggregates them into one defensible verdict.2026-05-07 · AtharvaAI Agent Evaluation Needs Regression Testing, Not Just BenchmarksA practical guide to AI agent evaluation with real-task workloads, replay evidence, scorecards, challenge packs, and CI regression gates.2026-03-23 · AtharvaWhy We Built AgentClashStatic benchmarks leak. Leaderboards reward hype. We built something different.
AI Agent Evaluation Blog - AgentClash