Multi-Agent Teams Hold Experts Back

Multi-agent LLM systems are increasingly deployed as autonomous collaborators, where agents interact freely rather than execute fixed, pre-specified workflows. In such settings, effective coordination cannot be fully designed in advance and must instead emerge through interaction. However, most prior work enforces coordination through fixed roles, workflows, or aggregation rules, leaving open the question of how well self-organizing teams perform when coordination is unconstrained. Drawing on organizational psychology, we study whether self-organizing LLM teams achieve strong synergy, where team performance matches or exceeds the best individual member. Across human-inspired and frontier ML benchmarks, we find that -- unlike human teams -- LLM teams consistently fail to match their expert agent's performance, even when explicitly told who the expert is, incurring performance losses of up to 37.6%. Decomposing this failure, we show that expert leveraging, rather than identification, is the primary bottleneck. Conversational analysis reveals a tendency toward integrative compromise -- averaging expert and non-expert views rather than appropriately weighting expertise -- which increases with team size and correlates negatively with performance. Interestingly, this consensus-seeking behavior improves robustness to adversarial agents, suggesting a trade-off between alignment and effective expertise utilization. Our findings reveal a significant gap in the ability of self-organizing multi-agent teams to harness the collective expertise of their members.

翻译：多智能体大语言模型系统正日益作为自主协作体被部署，其中智能体自由交互而非执行预先设定的固定工作流。在此类场景中，有效的协调无法完全预先设计，而必须通过交互动态涌现。然而，现有研究大多通过固定角色、工作流或聚合规则来强制协调，这使自组织团队在无约束协调下的性能表现成为开放性问题。借鉴组织心理学理论，我们研究了自组织大语言模型团队是否能实现强协同效应，即团队表现达到或超越最佳个体成员。通过在人类启发性基准和前沿机器学习基准上的实验，我们发现——与人类团队不同——大语言模型团队始终无法匹配其专家智能体的表现，即使明确告知专家身份，其性能损失仍高达37.6%。通过分解失败原因，我们证明专家能力利用（而非专家识别）是主要瓶颈。对话分析揭示了趋向整合性妥协的倾向——即平均专家与非专家观点而非合理加权专业知识——这种行为随团队规模扩大而增强，并与表现呈负相关。有趣的是，这种寻求共识的行为提升了对抗性智能体的鲁棒性，表明对齐性与专家能力有效利用之间存在权衡。我们的研究揭示了自组织多智能体团队在利用成员集体专业知识方面存在显著能力差距。