Multi-Agent Teams Hold Experts Back

Multi-agent LLM systems are increasingly deployed as autonomous collaborators, where agents interact freely rather than execute fixed, pre-specified workflows. In such settings, effective coordination cannot be fully designed in advance and must instead emerge through interaction. However, most prior work enforces coordination through fixed roles, workflows, or aggregation rules, leaving open the question of how well self-organizing teams perform when coordination is unconstrained. Drawing on organizational psychology, we study whether self-organizing LLM teams achieve strong synergy, where team performance matches or exceeds the best individual member. Across human-inspired and frontier ML benchmarks, we find that -- unlike human teams -- LLM teams consistently fail to match their expert agent's performance, even when explicitly told who the expert is, incurring performance losses of up to 41.1% on ML benchmarks. Decomposing this failure, we show that expert leveraging, rather than identification, is the primary bottleneck. Conversational analysis reveals a tendency toward integrative compromise -- averaging expert and non-expert views rather than appropriately weighting expertise -- which increases with team size and correlates negatively with performance. Interestingly, this consensus-seeking behavior improves robustness to adversarial agents, suggesting a trade-off between alignment and effective expertise utilization. Our findings reveal a significant gap in the ability of self-organizing multi-agent teams to harness the collective expertise of their members.

翻译：多智能体大语言模型系统正日益被部署为自主协作者，其中智能体可自由交互而非执行固定的预设工作流程。在此类场景中，有效协调无法完全预先设计，而需通过交互动态涌现。然而，现有研究大多通过固定角色、工作流程或聚合规则强制实现协调，尚未解答当协调不受约束时自组织团队的实际表现。借鉴组织心理学原理，本研究考察了自组织大语言模型团队能否实现强协同效应——即团队绩效达到或超越最优个体成员水平。在人类启发性基准与前沿机器学习基准测试中，我们发现：与人类团队不同，大语言模型团队始终无法达到其专家智能体的独立表现水平，即便被明确告知专家身份，其机器学习基准绩效损失仍高达41.1%。通过分解这一失败原因，我们证明"专家利用"而非"专家识别"是主要瓶颈。对话分析揭示了整合妥协倾向——即平均化专家与非专家观点而非合理分配专业权重，该倾向随团队规模增大而增强，且与绩效呈负相关。值得注意的是，这种寻求共识的行为能提升对抗恶意智能体的鲁棒性，暗示模型对齐与有效专业利用之间存在权衡。我们的发现揭示了自组织多智能体团队在利用成员集体专长方面存在重大缺陷。