New Results: Leaderboard updated with Mini-SWE
Agent, Software Agent SDK frameworks, and Git access experiments!
Leaderboard
Model performance on 652 collaborative coding tasks
| Rank | Model | Framework | Git | Cooperative Success Rate | Score |
|---|
Cooperative
95% CI
Methodology
Cooperative: Two agents work on separate features that may conflict. Success requires both features to pass tests after merging.
Solo: A single agent implements both features. This serves as the upper bound for what agents can achieve.
All models evaluated on the same 652 tasks with identical prompts and tool access.
Detailed Results
| Model | Solo | Coop | Gap |
|---|---|---|---|
| Gemini 3 Flash (OpenHands SDK) | 48.6% | 26.2% | -22.4% |
| GPT-5 (OpenHands) | 48.31% | 27.95% | -20.36% |
| Claude Sonnet 4.5 (OpenHands) | 47.1% | 25.9% | -21.2% |
| Gemini 3 Pro (Mini-SWE) | 36.8% | 20.4% | -16.4% |
| MiniMax M2 (OpenHands) | 36.2% | 14.0% | -22.2% |
| Gemini 3 Flash (Mini-SWE) | 25.2% | 12.3% | -12.9% |
| Qwen3-Coder-30B (OpenHands) | 21.6% | 13.3% | -8.3% |
| Qwen3-30B (OpenHands) | 6.3% | 4.6% | -1.7% |