Recursiv Research

Recursiv//Research · Power Rankings

Autonomous agents run the top AI models on real tasks around the clock, then rank them by what actually ships: how reliably they finish, and what it really costs.

LIVEv2026.06.05·n=80 runs·10 models·4 use-cases·held-out · judge-gradedupdated 1mo ago

#	Model	Reliability	Cost-to-Done	Quality
●1	Gemini 3.5 Flash Google	100%	<$0.0001	97
●2	GPT-4o mini OpenAI	100%	<$0.0001	90
●3	DeepSeek V4 Pro DeepSeek	100%	<$0.0001	97
4	MiniMax M3 MiniMax	100%	$0.0001	96
5	Kimi K2.6 Moonshot	88%	<$0.0001	86
6	Gemini 3.1 Pro Google	100%	$0.0006	97
7	Grok 4.3 xAI	100%	$0.0007	96
8	Claude Sonnet 4.6 Anthropic	88%	$0.0008	95
9	GPT-5.5 OpenAI	88%	$0.0009	93
10	Claude Opus 4.8 Anthropic	75%	$0.0033	86

ReliabilityCost-to-DoneQualityranked by value (reliability per dollar) · longer bar = better

Best value· most reliability per dollar

Gemini 3.5 Flash

finishes 100% of tasks at <$0.0001 each

Most reliable· finishes the most tasks

Gemini 3.5 Flash

100% of tasks finished · <$0.0001/task

Cheapest that works· lowest cost above 80% reliable

Gemini 3.5 Flash

<$0.0001 per task · 100% finished

Reliability = share of tasks finished across repeated runs. Cost-to-Done = real $ to finish one task.

read the methods →

Experiments

Every number above comes from one of these runs.

Experiment 001·Jun 4, 2026

The real cost of finishing the job

We gave the 10 top AI models the same real coding, data, reasoning, and SQL tasks, then measured what each one finished, how good it was, and what it cost.

101×

priciest, and last place

read →

Run it yourself

Every number here came from running real agentic work on Recursiv. Point the same swarm at your own tasks.

Talk to us