Why we built this
We got tired of being lied to.
It passed every eval we had. It failed in week one.
None of the benchmarks had touched our task.
The only eval you can trust is the one you ran yourself — your task, every model, at the same time.
AgentClash is that eval.
Pick your task the way your product actually runs it. Six models race, live, on the same inputs with the same tools. Scored on what matters in production — correctness, cost, latency, behaviour under pressure. When one fails, the failing trace becomes a test. Every mistake ratchets the eval tighter.
Your task. Your models. Your scoreboard.