CooperBench: Why Coding Agents Cannot be Your Teammates Yet

Resolving team conflicts requires not only task-specific competence, but also social intelligence to find common ground and build consensus. As AI agents increasingly collaborate on complex work, they must develop coordination capabilities to function as effective teammates. Yet we hypothesize that current agents lack these capabilities. To test this, we introduce CooperBench, a benchmark of over 600 collaborative coding tasks across 12 libraries in 4 programming languages. Each task assigns two agents different features that can be implemented independently but may conflict without proper coordination. Tasks are grounded in real open-source repositories with expert-written tests. Evaluating state-of-the-art coding agents, we observe the curse of coordination: agents achieve on average 30% lower success rates when working together compared to performing both tasks individually. This contrasts sharply with human teams, where adding teammates typically improves productivity. Our analysis reveals three key issues: (1) communication channels become jammed with vague, ill-timed, and inaccurate messages; (2) even with effective communication, agents deviate from their commitments; and (3) agents often hold incorrect expectations about others' plans and communication. Through large-scale simulation, we also observe rare but interesting emergent coordination behavior including role division, resource division, and negotiation. Our research presents a novel benchmark for collaborative coding and calls for a shift from pursuing individual agent capability to developing social intelligence.

翻译：解决团队冲突不仅需要任务特定的能力，还需要通过社交智能寻找共同点并建立共识。随着人工智能智能体日益在复杂工作中开展协作，它们必须发展协调能力以成为有效的团队成员。然而，我们假设当前智能体缺乏这些能力。为验证此假设，我们提出了CooperBench——一个包含4种编程语言中12个库的600多项协作编码任务的基准测试集。每个任务为两个智能体分配不同的功能特性，这些特性可独立实现，但若缺乏适当协调则可能产生冲突。所有任务均基于真实开源代码库，并配备专家编写的测试用例。通过对最先进的编码智能体进行评估，我们观察到“协调诅咒”现象：与独立完成两项任务相比，智能体协作时的平均成功率降低30%。这与人类团队形成鲜明对比——增加团队成员通常能提升生产力。我们的分析揭示了三个关键问题：(1) 沟通渠道被模糊、时机不当且不准确的信息阻塞；(2) 即使存在有效沟通，智能体仍会偏离其承诺；(3) 智能体常对其他成员的计划和沟通持有错误预期。通过大规模模拟，我们还观察到罕见但有趣的涌现协调行为，包括角色分工、资源分配和协商。本研究提出了一个新颖的协作编码基准测试集，并呼吁从追求个体智能体能力转向发展社交智能。