recursiv/research
← leaderboard
live · June 2, 2026

The real cost of finishing the job

Price-per-token is a lie for agents. We ran ten frontier models on the same real tasks and measured the dollars it actually took to finish. The cheapest models were often the most expensive.

$0.21
cheapest cost-to-done
3.4x
cheapest vs priciest
10
models tested
30
runs per model

Cost-to-Done vs completion

Efficiency frontier is up and to the left. Bars are 95% CI on cost.

0%25%50%75%100%$0.20$0.30$0.50$0.75$1.00Cost-to-Done ($ / task, log) — cheaper →Task completion (pass^3) — better ↑Claude Sonnet 4.6Claude Opus 4.6Gemini 3.1 Pro
on the efficiency frontier dominatedhorizontal bars = 95% CI on cost · hover a point for details

Real cost to finish, by model

Claude Sonnet 4.6
$0.21
Grok 4.1 Fast
$0.29 · 1.4x
Gemini 3 Flash
$0.33 · 1.6x
Kimi K2.5
$0.34 · 1.6x
Gemini 3.1 Pro
$0.39 · 1.9x
Claude Opus 4.6
$0.42 · 2.0x
MiniMax M2.5
$0.46 · 2.2x
GPT-5.4
$0.48 · 2.3x
DeepSeek V3.2
$0.51 · 2.4x
GPT-4o mini
$0.72 · 3.4x
Verdict

The best-value model was not the cheapest per token. It was the one that finished reliably without burning retries. Rank by Cost-to-Done, not sticker price, or you ship the most expensive option by accident.

Each model ran the same real, multi-step software tasks on Recursiv: read a repo, make a change, run the tests, fix its own failures until the suite passed. We recorded whether it finished, the real dollars spent including every retry, and the full transcript.

The models with the lowest sticker price were not the cheapest to finish. The smallest model looked 30x cheaper per token, yet its Cost-to-Done was the highest in the field — it failed, retried, and looped past every frontier model on total spend.

Reproduce it

This experiment runs on Recursiv. Book a demo and we will run it against your own tasks and your own definition of done.

The receipts

Not a synthetic benchmark. An actual agent run recorded on Recursiv: every tool call, every retry, and the real dollars it cost.

Methodology

Every model ran the same held-out task suite, N times, on Recursiv. Cost is real provider spend including retries. Numbers carry 95% confidence intervals. Full method →

Run it yourself

Stop guessing which model to ship.

Every number here was produced by running real agentic work on Recursiv. Book a demo and we will show you the platform these experiments run on.