MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MASOrchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented subagents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and subagents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA, while achieving more than 10x efficiency over strong baselines. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.

翻译：尽管多智能体系统通过智能体协调有望实现更高的智能水平，但当前自动多智能体系统设计方法效果欠佳。此类不足源于两个关键因素：（1）方法论复杂性——智能体编排采用顺序执行的代码级方式，限制了全局系统级推理能力，且随着智能体复杂性增加难以扩展；（2）效能不确定性——在未明确相比单智能体系统是否具有实质优势的情况下部署多智能体系统。我们提出MAS-Orchestra，这是一个训练时框架，将多智能体系统编排建模为带全局编排的函数调用强化学习问题，可一次性生成完整的多智能体系统。在MAS-Orchestra中，复杂的目标导向型子智能体被抽象为可调用函数，在隐藏内部执行细节的同时实现系统结构的全局推理。为严谨研究多智能体系统何时及为何更优，我们引入MAS-BENCH这一受控基准，沿五个维度（深度、视野、广度、并行性与鲁棒性）对任务进行刻画。分析表明，多智能体系统的增益关键取决于任务结构、验证协议以及编排器和子智能体的能力，而非具有普适性。基于这些洞见，MAS-Orchestra在数学推理、多跳问答和搜索型问答等公开基准上实现了一致性提升，且相比强基准方法效率提升超过10倍。MAS-Orchestra与MAS-BENCH共同推动了多智能体系统在追求多智能体智能过程中的更好训练与理解。