MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MASOrchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented subagents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and subagents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA, while achieving more than 10x efficiency over strong baselines. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.

翻译：尽管多智能体系统（MAS）通过智能体间的协同合作有望实现更高级的智能，但当前自动设计多智能体系统的方法未能达到预期效果。此类不足源于两个关键因素：（1）方法复杂性——智能体编排采用顺序的代码级执行方式，限制了全局系统层面的整体推理能力，且难以随智能体复杂度扩展；（2）效能不确定性——在部署多智能体系统时，未能明确其相较于单智能体系统（SAS）是否存在实质优势。我们提出MAS-Orchestra，一种训练时框架，将多智能体编排建模为具有整体编排能力的函数调用强化学习问题，能够一次性生成完整的多智能体系统。在MAS-Orchestra中，复杂的目标导向子智能体被抽象为可调用函数，从而在隐藏内部执行细节的同时实现对系统结构的全局推理。为严谨研究多智能体系统何时及为何具有优势，我们提出MASBENCH——一个沿五个维度（深度、时域、广度、并行性与鲁棒性）刻画任务的受控基准。我们的分析表明，多智能体系统的优势并非普遍存在，而是关键取决于任务结构、验证协议以及编排器与子智能体的能力。基于这些发现，MAS-Orchestra在数学推理、多跳问答和基于搜索的问答等公共基准测试中实现了持续改进，同时较强大基线获得了超过10倍的效率提升。MAS-Orchestra与MASBENCH共同为追求多智能体智能提供了更优的训练与理解框架。