New Results: Leaderboard updated with Mini-SWE Agent, Software Agent SDK frameworks, and Git access experiments!

Leaderboard

Model performance on 652 collaborative coding tasks

Rank	Model	Framework	Git	Cooperative Success Rate	Score

Cooperative

95% CI

Cooperative: Two agents work on separate features that may conflict. Success requires both features to pass tests after merging.

Solo: A single agent implements both features. This serves as the upper bound for what agents can achieve.

All models evaluated on the same 652 tasks with identical prompts and tool access.

Detailed Results

Model	Solo	Coop	Gap
Gemini 3 Flash (OpenHands SDK)	48.6%	26.2%	-22.4%
GPT-5 (OpenHands)	48.31%	27.95%	-20.36%
Claude Sonnet 4.5 (OpenHands)	47.1%	25.9%	-21.2%
Gemini 3 Pro (Mini-SWE)	36.8%	20.4%	-16.4%
MiniMax M2 (OpenHands)	36.2%	14.0%	-22.2%
Gemini 3 Flash (Mini-SWE)	25.2%	12.3%	-12.9%
Qwen3-Coder-30B (OpenHands)	21.6%	13.3%	-8.3%
Qwen3-30B (OpenHands)	6.3%	4.6%	-1.7%