The Curse of Coordination: Why Agents Cannot be Your Teammates Yet

652 Tasks
Real-world coding tasks from 12 popular open-source libraries across Python, TypeScript, Go, and Rust.
Coordination Gap
State-of-the-art agents achieve only 25% success with two-agent cooperation, ~50% lower than solo baselines.
3 Research Questions
How well can agents cooperate? What role does communication play? What coordination failures do agents exhibit?

Can AI agents work together as teammates? We find that coordinating agents perform much worse than a single agent given the same total workload. This coordination deficit presents a fundamental barrier to deploying AI systems that can work alongside humans or other agents.

Key Results

We evaluate state-of-the-art coding agents on CooperBench and observe the curse of coordination: agents perform significantly worse when cooperating than when working alone.

The Coordination Gap

GPT-5 and Claude Sonnet 4.5 based agents achieve only ~25% success rate with two-agent cooperation on CooperBench, which is around 50% lower than a "Solo" baseline where one agent implements both features.

Research Questions

RQ1

How well can agents cooperate with each other?

RQ2

What role does communication play in agent-agent cooperation?

RQ3

What coordination failures do agents exhibit?

Key Coordination Failures

Communication Breakdown

Agents send vague, ill-timed, and inaccurate messages. They fail to respond to direct questions and flood channels with repetitive status updates.

Commitment Deviation

Even with effective communication, agents deviate from their commitments. They make unverifiable claims about code state and ignore agreed-upon integration points.

Incorrect Expectations

Agents hold incorrect expectations about their partner's plans, duplicate work despite warnings, and overwrite changes they believe will merge cleanly.

Emergent Coordination Behaviors

Among runs that succeed, we observe coordination patterns largely absent from failed runs:

Role Division

Agents agree on who handles which part of the task and establish clear boundaries around their scope.

Resource Division

Agents avoid collisions by partitioning shared resources - specific files, code ranges, or ownership blocks.

Negotiation

Agents resolve conflicting approaches by proposing alternatives and converging on a single plan before acting.

The Coordination Gap

Our experiments reveal a consistent pattern: agents perform significantly worse when collaborating than when working alone. The gap persists across all models and increases with task difficulty.

Main Results: Solo vs Collaborative success rates across models and difficulty levels

Left: Success rates by model (Solo in blue, Collaborative in black). Right: Success rates vs relative task difficulty. The shaded area shows the coordination gap.

Emergent Coordination Patterns

We observe three emergent coordination behaviors in successful runs. These patterns are not prompted or scaffolded - they emerge when agents successfully navigate partial observability.

Role Division: Agents agree on who handles which part of the task and establish clear boundaries around their scope.

Communication
View
Edit
Bash
Search
Agent A
Agent B
Interaction Log
Loading...

Coordination Failure Taxonomy

Through qualitative analysis of agent interactions, we identified observable symptoms (how failures manifest) and underlying capability gaps (why failures occur).

Capability Gaps (Causes)

Cause Definition %
Expectation Cases where one agent has clearly communicated what they are doing, but the other agent still treats the situation as if that work is not being done. This reflects a failure to model the state of the other agent's code changes. 62.9
Communication Breakdowns in using language to coordinate. This includes failures in information sharing and decision loops, where agents do not effectively communicate their intentions, questions, or status updates. 27.5
Commitment Cases where an agent is not doing the things they promised to do. This includes failures to establish or maintain verifiable integration contracts. 9.6

Failure Symptoms

Symptom Meaning %
Work overlap Both agents independently implement the same functionality, duplicating work and overwriting details. 33.2
Divergent architecture Incompatible design decisions lead to semantic loss even under a clean merge. 29.7
Repetition Verbose status messages add little new information and reduce signal. 14.7
Unresponsiveness Direct questions or requests are not answered, breaking the decision loop. 8.7
Unverifiable claims Agent asserts a change or interface decision without evidence the partner can check. 4.3
Broken commitment Confident completion claims create false shared context when the promised change is absent. 3.7
Dependency access Missing risk communication leaves agents unable to anticipate merged dependency interactions. 1.7
Placeholder misuse An explicit integration contract exists but is applied differently than agreed. 1.5
Parameter flow Ambiguity about a changing interface leaves one agent implementing against an outdated contract. 1.3
Timing dependency Agents agree on order but fail to communicate an enforceable plan that preserves it after merge. 1.1

Communication Patterns

Analysis of agent conversations reveals significant differences in communication styles across models. Some agents engage in lengthy negotiations while others barely communicate.

Communication error analysis across models

(a) Average conversation turns by model. (b) Repetition rates. (c) Unresponsiveness patterns. (d) Hallucination rates.

About CooperBench

CooperBench is the first benchmark designed to measure how well AI agents can cooperate when handling individual tasks with potential conflicts. We focus on software engineering as a realistic domain where humans typically need to navigate work in a team.

652
Tasks
12
Repositories
4
Languages
8
Annotators

Supported Languages

Python TypeScript Go Rust

Task Design

Each task assigns each agent a feature to be implemented based on the same repository state. Conflicts are intentionally embedded at the code level, as the assigned features are logically compatible but require agents to modify overlapping or interdependent code.

Eight co-authors with real-world software engineering backgrounds created new features, unit tests, and ground-truth code for these libraries, ensuring high-quality and realistic task design.

Getting Started

Installation

Clone the CooperBench repository:

git clone https://github.com/akhatua2/CodeConflictBenchmark.git
cd CodeConflictBenchmark
pip install -r requirements.txt

Download Dataset

Access the dataset from HuggingFace:

from datasets import load_dataset

dataset = load_dataset("SALT-NLP/CooperBench")

BibTeX

@article{khatua2025coordination,
  title={The Curse of Coordination: Why Agents Cannot be Your Teammates Yet},
  author={Khatua, Arpandeep and Zhu, Hao and Tran, Peter and Prabhudesai, Arya and Sadrieh, Frederic and Lieberwirth, Johann K. and Yu, Xinkai and Fu, Yicheng and Ryan, Michael J. and Pei, Jiaxin and Yang, Diyi},
  journal={arXiv preprint},
  year={2025},
  url={https://cotomata.com/}
}