recursiv/research
Agentic model leaderboard

Ranked by Cost-to-Done: the real dollars to finish the job.

preview dataupdated 18h ago
StandardRecursiv-only · agentic
#ModelScoreQualitySpeed$/1M tokCost-to-DoneCompletionTool acc.Self-corr.
1
Claude Sonnet 4.6
Anthropic
858688$1.80$0.2185%93%71%
2
Claude Opus 4.6
Anthropic
809261$9.00$0.4293%97%81%
3
Gemini 3.1 Pro
Google
788996$5.00$0.3988%94%74%
4
Grok 4.1 Fast
xAI
7482142$2.00$0.2979%88%63%
5
GPT-5.4
OpenAI
739074$8.00$0.4890%95%78%
6
Kimi K2.5
Moonshot
628070$1.20$0.3474%85%58%
7
Gemini 3 Flash
Google
5373168$0.30$0.3359%81%47%
8
DeepSeek V3.2
DeepSeek
437858$0.45$0.5168%79%49%
9
MiniMax M2.5
MiniMax
407481$0.55$0.4662%76%44%
10
GPT-4o mini
OpenAI
664121$0.26$0.7241%69%33%

Benchmarks test models in a vacuum. We run them on real, multi-step work and rank by what it actually costs to finish. The Recursiv-only columns are ones no single-model benchmark can produce.

Run it yourself

Stop guessing which model to ship.

Every number here was produced by running real agentic work on Recursiv. Book a demo and we will show you the platform these experiments run on.