The real cost of finishing the job
Price-per-token is a lie for agents. We ran ten frontier models on the same real tasks and measured the dollars it actually took to finish. The cheapest models were often the most expensive.
Cost-to-Done vs completion
Efficiency frontier is up and to the left. Bars are 95% CI on cost.
Real cost to finish, by model
The best-value model was not the cheapest per token. It was the one that finished reliably without burning retries. Rank by Cost-to-Done, not sticker price, or you ship the most expensive option by accident.
Each model ran the same real, multi-step software tasks on Recursiv: read a repo, make a change, run the tests, fix its own failures until the suite passed. We recorded whether it finished, the real dollars spent including every retry, and the full transcript.
The models with the lowest sticker price were not the cheapest to finish. The smallest model looked 30x cheaper per token, yet its Cost-to-Done was the highest in the field — it failed, retried, and looped past every frontier model on total spend.
This experiment runs on Recursiv. Book a demo and we will run it against your own tasks and your own definition of done.
The receipts
Not a synthetic benchmark. An actual agent run recorded on Recursiv: every tool call, every retry, and the real dollars it cost.
Every model ran the same held-out task suite, N times, on Recursiv. Cost is real provider spend including retries. Numbers carry 95% confidence intervals. Full method →