Multi-agent LLM systems are increasingly deployed as autonomous collaborators, where agents interact freely rather than execute fixed, pre-specified workflows. In such settings, effective coordination cannot be fully designed in advance and must instead emerge through interaction. However, most prior work enforces coordination through fixed roles, workflows, or aggregation rules, leaving open the question of how well self-organizing teams perform when coordination is unconstrained. Drawing on organizational psychology, we study whether self-organizing LLM teams achieve strong synergy, where team performance matches or exceeds the best individual member. Across human-inspired and frontier ML benchmarks, we find that -- unlike human teams -- LLM teams consistently fail to match their expert agent's performance, even when explicitly told who the expert is, incurring performance losses of up to 37.6%. Decomposing this failure, we show that expert leveraging, rather than identification, is the primary bottleneck. Conversational analysis reveals a tendency toward integrative compromise -- averaging expert and non-expert views rather than appropriately weighting expertise -- which increases with team size and correlates negatively with performance. Interestingly, this consensus-seeking behavior improves robustness to adversarial agents, suggesting a trade-off between alignment and effective expertise utilization. Our findings reveal a significant gap in the ability of self-organizing multi-agent teams to harness the collective expertise of their members.
翻译:多智能体大语言模型系统正日益作为自主协作体被部署,其中智能体自由交互而非执行固定的预设工作流。在此类场景中,有效的协调无法完全预先设计,而必须通过交互动态涌现。然而,现有研究大多通过固定角色、工作流或聚合规则来强制协调,这留下了一个开放性问题:当协调不受约束时,自组织团队的表现究竟如何。借鉴组织心理学的研究,我们探讨自组织大语言模型团队能否实现强大的协同效应,即团队表现达到或超越其最佳个体成员的水平。通过在受人类启发的基准测试和前沿机器学习基准测试上的实验,我们发现——与人类团队不同——大语言模型团队始终无法匹配其专家智能体的表现,即使明确告知专家身份,其性能损失仍高达37.6%。通过分解这一失败原因,我们证明专家能力利用(而非专家识别)是主要瓶颈。对话分析揭示了团队倾向于进行整合性妥协——即平均专家与非专家的观点而非合理加权专业知识——这种行为随团队规模扩大而加剧,并与表现呈负相关。有趣的是,这种寻求共识的行为提升了对抗性智能体的鲁棒性,表明在团队一致性与有效利用专业知识之间存在权衡。我们的研究结果揭示了自组织多智能体团队在利用成员集体专业知识方面存在显著能力差距。