Chain-of-thought prompting has popularized step-by-step reasoning in large language models, yet model performance still degrades as problem complexity and context length grow. By decomposing difficult tasks with long contexts into shorter, manageable ones, recent multi-agent paradigms offer a promising near-term solution to this problem. However, the fundamental capacities of such systems are poorly understood. In this work, we propose a theoretical framework to analyze the expressivity of multi-agent systems. We apply our framework to three algorithmic families: state tracking, recall, and $k$-hop reasoning. We derive bounds on (i) the number of agents required to solve the task exactly, (ii) the quantity and structure of inter-agent communication, and (iii) the achievable speedups as problem size and context scale. Our results identify regimes where communication is provably beneficial, delineate tradeoffs between agent count and bandwidth, and expose intrinsic limitations when either resource is constrained. We complement our theoretical analysis with a set of experiments on pretrained LLMs using controlled synthetic benchmarks. Empirical outcomes confirm the tradeoffs between key quantities predicted by our theory. Collectively, our analysis offers principled guidance for designing scalable multi-agent reasoning systems.
翻译:思维链提示已使大语言模型中的分步推理得到普及,但模型性能仍随问题复杂度和上下文长度的增加而下降。通过将具有长上下文的困难任务分解为更短、更易处理的任务,近期的多智能体范式为此问题提供了一个有前景的近期解决方案。然而,此类系统的基本能力尚未得到充分理解。在本工作中,我们提出了一个理论框架来分析多智能体系统的表达能力。我们将该框架应用于三种算法族:状态跟踪、信息回溯和 $k$-跳推理。我们推导出以下方面的界限:(i) 精确解决任务所需的智能体数量,(ii) 智能体间通信的数量与结构,以及 (iii) 随着问题规模和上下文扩展可实现的加速比。我们的结果识别了通信被证明有益的机制,描绘了智能体数量与通信带宽之间的权衡,并揭示了当任一资源受限时的内在局限性。我们通过在受控合成基准上对预训练大语言模型进行一系列实验来补充理论分析。实证结果证实了我们的理论所预测的关键量之间的权衡关系。总体而言,我们的分析为设计可扩展的多智能体推理系统提供了原则性指导。