Multi-agent systems built on large language models (LLMs) are expected to enhance decision-making by pooling distributed information, yet systematically evaluating this capability has remained challenging. We introduce HiddenBench, a 65-task benchmark grounded in the Hidden Profile paradigm, which isolates collective reasoning under distributed information from individual reasoning ability. Evaluating 15 frontier LLMs, we find that multi-agent LLMs achieve only 30.1% accuracy under distributed information, compared to 80.7% accuracy for single agents given complete information. We trace this gap to a systematic failure mode: agents cannot recognize or act under latent information asymmetry-they fail to reason about what others might know but have not yet expressed, leading to premature convergence on shared evidence while critical distributed facts remain unexplored. These failures persist across prompting strategies, communication depths, and group sizes-and worsen as groups scale. While some models (e.g., Gemini-2.5-Flash/Pro) outperform others, neither model scale nor individual reasoning accuracy reliably predicts collective performance. Our results identify failures in collective information exploration in decision-making as a key limitation of multi-agent LLMs, and provide a theory-grounded, reproducible framework for diagnosing collective reasoning failures.
翻译:基于大语言模型(LLMs)构建的多智能体系统有望通过汇集分布式信息来提升决策能力,但系统性地评估这一能力仍具挑战性。我们提出了HiddenBench——一个基于“隐藏档案”范式的包含65项任务的基准测试,旨在将分布式信息下的集体推理能力与个体推理能力进行隔离评估。通过对15个前沿大语言模型的测试,我们发现多智能体大语言模型在分布式信息下的准确率仅为30.1%,而拥有完整信息的单智能体准确率可达80.7%。我们追溯这一差距至一种系统性失效模式:智能体无法识别或应对潜在的信息不对称——它们未能推理出其他智能体可能知晓但尚未表达的信息,导致过早收敛于共享证据,而关键的分布式事实则未被探索。这些失效现象在不同提示策略、通信深度和群体规模下持续存在,并随着群体规模扩大而加剧。虽然某些模型(例如Gemini-2.5-Flash/Pro)表现优于其他模型,但无论是模型规模还是个体推理准确率,都无法可靠地预测集体表现。我们的研究结果揭示了多智能体大语言模型在决策过程中集体信息探索方面的失效是其关键局限,并提供了一个基于理论的、可复现的框架,用于诊断集体推理失效问题。