Leveraging multiple large language models (LLMs) to build collaborative multi-agentic workflows has demonstrated significant potential. However, most previous studies focus on prompting the out-of-the-box LLMs, relying on their innate capability for collaboration, which may not improve LLMs' performance as shown recently. In this paper, we introduce a new post-training paradigm MAPoRL (Multi-Agent Post-co-training for collaborative LLMs with Reinforcement Learning), to explicitly elicit the collaborative behaviors and further unleash the power of multi-agentic LLM frameworks. In MAPoRL, multiple LLMs first generate their own responses independently and engage in a multi-turn discussion to collaboratively improve the final answer. In the end, a MAPoRL verifier evaluates both the answer and the discussion, by assigning a score that verifies the correctness of the answer, while adding incentives to encourage corrective and persuasive discussions. The score serves as the co-training reward, and is then maximized through multi-agent RL. Unlike existing LLM post-training paradigms, MAPoRL advocates the co-training of multiple LLMs together using RL for better generalization. Accompanied by analytical insights, our experiments demonstrate that training individual LLMs alone is insufficient to induce effective collaboration. In contrast, multi-agent co-training can boost the collaboration performance across benchmarks, with generalization to unseen domains.
翻译:利用多个大语言模型构建协作式多智能体工作流已展现出巨大潜力。然而,先前研究大多聚焦于直接提示现成的大语言模型,依赖其固有的协作能力,而近期研究表明这未必能提升模型性能。本文提出一种新的后训练范式MAPoRL(基于强化学习的协作大语言模型多智能体协同后训练),旨在显式激发协作行为,进一步释放多智能体大语言模型框架的潜力。在MAPoRL中,多个大语言模型首先独立生成自身响应,随后通过多轮讨论协作改进最终答案。最终,MAPoRL验证器通过分配评分来评估答案与讨论过程:该评分既验证答案的正确性,同时设置激励以促进纠正式与说服性讨论。该评分作为协同训练奖励,并通过多智能体强化学习实现最大化。与现有大语言模型后训练范式不同,MAPoRL主张使用强化学习对多个大语言模型进行协同训练以获得更好的泛化能力。结合理论分析,我们的实验表明:单独训练个体大语言模型不足以引发有效协作;相比之下,多智能体协同训练能在多个基准测试中提升协作性能,并实现向未见领域的泛化。