MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MAS-Orchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented sub-agents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and sub-agents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.

翻译：尽管多智能体系统（MAS）通过智能体间的协调有望实现更高层次的智能，但当前自动设计MAS的方法未能充分发挥其潜力。此类不足源于两个关键因素：（1）方法复杂性——智能体编排采用顺序的、代码级的执行方式，限制了全局系统层面的整体推理能力，且难以随智能体复杂度扩展；（2）效能不确定性——在部署MAS时，并未明确其相较于单智能体系统（SAS）是否具有切实优势。我们提出MAS-Orchestra，一种训练时框架，将MAS编排建模为具有整体编排能力的函数调用强化学习问题，可一次性生成完整的MAS。在MAS-Orchestra中，复杂的目标导向子智能体被抽象为可调用函数，从而在隐藏内部执行细节的同时实现对系统结构的全局推理。为严谨研究MAS何时及为何具有优势，我们引入MASBENCH，一个受控基准，从五个维度刻画任务特性：深度、跨度、广度、并行性与鲁棒性。我们的分析表明，MAS的增益关键取决于任务结构、验证协议以及编排器与子智能体的能力，而非普遍适用。基于这些洞见，MAS-Orchestra在包括数学推理、多跳问答和基于搜索的问答在内的公共基准测试中实现了持续改进。MAS-Orchestra与MASBENCH共同为追求多智能体智能提供了更优的训练与理解途径。