CooperBench: Why Coding Agents Cannot be Your Teammates Yet

Resolving team conflicts requires not only task-specific competence, but also social intelligence to find common ground and build consensus. As AI agents increasingly collaborate on complex work, they must develop coordination capabilities to function as effective teammates. Yet we hypothesize that current agents lack these capabilities. To test this, we introduce CooperBench, a benchmark of over 600 collaborative coding tasks across 12 libraries in 4 programming languages. Each task assigns two agents different features that can be implemented independently but may conflict without proper coordination. Tasks are grounded in real open-source repositories with expert-written tests. Evaluating state-of-the-art coding agents, we observe the curse of coordination: agents achieve on average 30% lower success rates when working together compared to performing both tasks individually. This contrasts sharply with human teams, where adding teammates typically improves productivity. Our analysis reveals three key issues: (1) communication channels become jammed with vague, ill-timed, and inaccurate messages; (2) even with effective communication, agents deviate from their commitments; and (3) agents often hold incorrect expectations about others' plans and communication. Through large-scale simulation, we also observe rare but interesting emergent coordination behavior including role division, resource division, and negotiation. Our research presents a novel benchmark for collaborative coding and calls for a shift from pursuing individual agent capability to developing social intelligence.

翻译：解决团队冲突不仅需要特定任务能力，还需具备寻找共同点并建立共识的社交智能。随着AI智能体日益在复杂工作中开展协作，它们必须发展协调能力以成为有效的团队成员。然而我们假设当前智能体缺乏这些能力。为验证此假设，我们推出CooperBench——一个包含4种编程语言、12个代码库中600余项协作编码任务的基准测试集。每项任务为两个智能体分配可独立实现但若缺乏适当协调可能产生冲突的不同功能特性。所有任务均基于真实开源代码库构建，并配备专家编写的测试用例。通过对前沿编码智能体的评估，我们观察到协调困境现象：智能体协同工作时的平均成功率比独立执行两项任务低30%。这与人类团队形成鲜明对比——增加团队成员通常能提升生产力。我们的分析揭示三个关键问题：（1）沟通渠道被模糊、时机不当且不准确的信息堵塞；（2）即使存在有效沟通，智能体仍会偏离其承诺；（3）智能体常对他人计划与沟通持有错误预期。通过大规模模拟，我们还观察到罕见但有趣的涌现协调行为，包括角色分工、资源分配与协商机制。本研究提出了协作编码的新基准，并呼吁从追求个体智能体能力转向发展社交智能。