Methodology
How we measure
Most leaderboards measure models in a vacuum: one prompt, one turn, price per token. We measure them doing real, multi-step work, and we show our uncertainty.
Models run as agents against live tools and repos. Completion is graded by outcomes, not preference.
Cost is actual provider spend per call, summed across the whole task including retries.
Each task runs many times. We report pass^k consistency, not a lucky best-of-N.
Every number carries its sample size and a 95% CI. Hover any leaderboard cell.
Held-out tasks; the grader runs outside the agent sandbox, so a model cannot read the answer.
The harness runs on Recursiv. You can run it against your own tasks.
What we measure
Three numbers answer the only question that matters for shipping agents: does it work, is it good, what does it cost?
Share of real multi-step tasks finished end-to-end, reported as pass^k reliability.
Output correctness on completed tasks, graded by an independent judge model.
Real dollars to fully complete a verified task, retries and self-correction included. Not price-per-token.
Further agentic metrics (tool-use accuracy, self-correction, multi-agent coordination) are part of the broader program and roll out as experiments land.
The Recursiv Score
One 0–100 composite. Each metric is normalized across the field and weighted as below. Cost-to-Done and completion dominate, because finishing reliably and cheaply is the point.
v1. Weights and tasks evolve as we add experiments; changes are versioned with each dataset update.