recursiv/research

Methodology

How we measure

Most leaderboards measure models in a vacuum: one prompt, one turn, price per token. We measure them doing real, multi-step work, and we show our uncertainty.

Real tasks, real tools

Models run as agents against live tools and repos. Completion is graded by outcomes, not preference.

Real dollars

Cost is actual provider spend per call, summed across the whole task including retries.

Reliability over peak

Each task runs many times. We report pass^k consistency, not a lucky best-of-N.

Show the uncertainty

Every number carries its sample size and a 95% CI. Hover any leaderboard cell.

Contamination resistant

Held-out tasks; the grader runs outside the agent sandbox, so a model cannot read the answer.

Reproducible

The harness runs on Recursiv. You can run it against your own tasks.

What we measure

Three numbers answer the only question that matters for shipping agents: does it work, is it good, what does it cost?

Task completion

Share of real multi-step tasks finished end-to-end, reported as pass^k reliability.

Quality

Output correctness on completed tasks, graded by an independent judge model.

Cost-to-Done

Real dollars to fully complete a verified task, retries and self-correction included. Not price-per-token.

Further agentic metrics (tool-use accuracy, self-correction, multi-agent coordination) are part of the broader program and roll out as experiments land.

The Recursiv Score

One 0–100 composite. Each metric is normalized across the field and weighted as below. Cost-to-Done and completion dominate, because finishing reliably and cheaply is the point.

Task completion
40%
Cost-to-Done
35%
Quality
25%

v1. Weights and tasks evolve as we add experiments; changes are versioned with each dataset update.

Run it yourself

Stop guessing which model to ship.

Every number here was produced by running real agentic work on Recursiv. Book a demo and we will show you the platform these experiments run on.