MARTI-MARS$^2$: Scaling Multi-Agent Self-Search via Reinforcement Learning for Code Generation

Shijie Wang,Pengfei Li,Yikun Fu,Kaifeng Liu,Fangyuan Li,Yang Liu,Xiaowei Sun,Zonglin Li,Siyao Zhao,Jian Zhao,Kai Tian,Dong Li,Junqi Gao,Yutong Zhang,Yiqun Chen,Yuqiang Li,Zoe Li,Weinan Zhang,Peng Ye,Shuyue Hu,Lei Bai,Bowen Zhou,Kaiyan Zhang,Biqing Qi

While the complex reasoning capability of Large Language Models (LLMs) has attracted significant attention, single-agent systems often encounter inherent performance ceilings in complex tasks such as code generation. Multi-agent collaboration offers a promising avenue to transcend these boundaries. However, existing frameworks typically rely on prompt-based test-time interactions or multi-role configurations trained with homogeneous parameters, limiting error correction capabilities and strategic diversity. In this paper, we propose a Multi-Agent Reinforced Training and Inference Framework with Self-Search Scaling (MARTI-MARS2), which integrates policy learning with multi-agent tree search by formulating the multi-agent collaborative exploration process as a dynamic and learnable environment. By allowing agents to iteratively explore and refine within the environment, the framework facilitates evolution from parameter-sharing homogeneous multi-role training to heterogeneous multi-agent training, breaking through single-agent capability limits. We also introduce an efficient inference strategy MARTI-MARS2-T+ to fully exploit the scaling potential of multi-agent collaboration at test time. We conduct extensive experiments across varied model scales (8B, 14B, and 32B) on challenging code generation benchmarks. Utilizing two collaborating 32B models, MARTI-MARS2 achieves 77.7%, outperforming strong baselines like GPT-5.1. Furthermore, MARTI-MARS2 reveals a novel scaling law: shifting from single-agent to homogeneous multi-role and ultimately to heterogeneous multi-agent paradigms progressively yields higher RL performance ceilings, robust TTS capabilities, and greater policy diversity, suggesting that policy diversity is critical for scaling intelligence via multi-agent reinforcement learning.

翻译：尽管大语言模型（LLM）的复杂推理能力已引起广泛关注，但单智能体系统在代码生成等复杂任务中常遭遇固有的性能瓶颈。多智能体协作为突破这些限制提供了一条前景广阔的途径。然而，现有框架通常依赖于基于提示的测试时交互或采用同质参数训练的多角色配置，这限制了错误纠正能力和策略多样性。本文提出了一种结合自搜索扩展的多智能体强化训练与推理框架（MARTI-MARS2），该框架通过将多智能体协同探索过程建模为一个动态且可学习的环境，将策略学习与多智能体树搜索相结合。通过允许智能体在环境中迭代探索与精炼，该框架促进了从参数共享的同质多角色训练到异质多智能体训练的演进，从而突破了单智能体的能力极限。我们还引入了一种高效的推理策略MARTI-MARS2-T+，以在测试时充分发挥多智能体协作的扩展潜力。我们在具有挑战性的代码生成基准测试上，针对不同模型规模（8B、14B和32B）进行了广泛实验。通过使用两个协作的32B模型，MARTI-MARS2取得了77.7%的得分，超越了如GPT-5.1等强基线模型。此外，MARTI-MARS2揭示了一种新颖的扩展规律：从单智能体范式转向同质多角色范式，最终演进到异质多智能体范式，能够逐步获得更高的强化学习性能上限、鲁棒的TTS能力以及更大的策略多样性，这表明策略多样性对于通过多智能体强化学习扩展智能至关重要。