While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MAS-Orchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented sub-agents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and sub-agents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.
翻译:尽管多智能体系统(MAS)通过智能体间的协调有望实现更高层次的智能,但当前自动设计MAS的方法未能充分发挥其潜力。此类不足源于两个关键因素:(1)方法复杂性——智能体编排采用顺序的、代码级的执行方式,限制了全局系统层面的整体推理能力,且难以随智能体复杂度扩展;(2)效能不确定性——在部署MAS时,并未明确其相较于单智能体系统(SAS)是否具有切实优势。我们提出MAS-Orchestra,一种训练时框架,将MAS编排建模为具有整体编排能力的函数调用强化学习问题,可一次性生成完整的MAS。在MAS-Orchestra中,复杂的目标导向子智能体被抽象为可调用函数,从而在隐藏内部执行细节的同时实现对系统结构的全局推理。为严谨研究MAS何时及为何具有优势,我们引入MASBENCH,一个受控基准,从五个维度刻画任务特性:深度、跨度、广度、并行性与鲁棒性。我们的分析表明,MAS的增益关键取决于任务结构、验证协议以及编排器与子智能体的能力,而非普遍适用。基于这些洞见,MAS-Orchestra在包括数学推理、多跳问答和基于搜索的问答在内的公共基准测试中实现了持续改进。MAS-Orchestra与MASBENCH共同为追求多智能体智能提供了更优的训练与理解途径。