New Results: Leaderboard updated with Mini-SWE Agent, Software Agent SDK frameworks, and Git access experiments!

Leaderboard

Model performance on 652 collaborative coding tasks

Rank Model Score

Methodology

Cooperative: Two agents work on separate features that may conflict. Success requires both features to pass tests after merging.

Solo: A single agent implements both features. This serves as the upper bound for what agents can achieve.

All models evaluated on the same 652 tasks with identical prompts and tool access.

Detailed Results

Model Solo Coop Gap
Gemini 3 Flash (OpenHands SDK) 48.6% 26.2% -22.4%
GPT-5 (OpenHands) 48.31% 27.95% -20.36%
Claude Sonnet 4.5 (OpenHands) 47.1% 25.9% -21.2%
Gemini 3 Pro (Mini-SWE) 36.8% 20.4% -16.4%
MiniMax M2 (OpenHands) 36.2% 14.0% -22.2%
Gemini 3 Flash (Mini-SWE) 25.2% 12.3% -12.9%
Qwen3-Coder-30B (OpenHands) 21.6% 13.3% -8.3%
Qwen3-30B (OpenHands) 6.3% 4.6% -1.7%

Model Results

0/0 tasks passed

Repository Task Features Status Trajectory