Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs). MAS improves task performance through role-based orchestration, while RL uses environmental rewards to learn stronger policies, such as GRPO-style optimization. However, applying on-policy RL to MAS remains underexplored and presents unique challenges. Algorithmically, standard GRPO grouping assumptions break down because prompts vary by role and by turn. System-wise, the training stack must support MAS-workflow rollouts and on-policy updates for both single-policy and multi-policy models. We propose AT-GRPO, which includes (i) an agent- and turn-wise grouped RL algorithm tailored to MAS and (ii) a training system that supports both single- and multi-policy regimes. Across game, planning, coding, and math tasks, AT-GRPO delivers substantial gains. On long-horizon planning, it increases accuracy from a 14.0 to 47.0 percent single-agent RL baseline to 96.0 to 99.5 percent. It also improves reasoning performance, with average gains of 3.87 to 7.62 percent on coding tasks and 9.0 to 17.93 percent on math. Code and environments are available at: https://github.com/pettingllms-ai/PettingLLMs.
翻译:多智能体系统与强化学习被广泛用于增强大语言模型的智能体能力。多智能体系统通过基于角色的编排提升任务性能,而强化学习则利用环境奖励学习更优策略,例如GRPO风格的优化方法。然而,将同策略强化学习应用于多智能体系统仍缺乏深入探索,并面临独特挑战。在算法层面,标准GRPO的分组假设因提示信息随角色和轮次变化而失效。在系统层面,训练框架需同时支持多智能体工作流的轨迹采样以及单策略与多策略模型的同策略更新。我们提出AT-GRPO方法,包含:(i) 专为多智能体系统设计的智能体与轮次分组强化学习算法;(ii) 支持单策略与多策略模式的训练系统。在游戏、规划、编程和数学任务中,AT-GRPO均取得显著提升。在长程规划任务中,其准确率将从单智能体强化学习基线(14.0%至47.0%)提升至96.0%至99.5%。该方法同时提升了推理性能:在编程任务中平均提升3.87%至7.62%,在数学任务中平均提升9.0%至17.93%。代码与环境已发布于:https://github.com/pettingllms-ai/PettingLLMs。