MARFT: Multi-Agent Reinforcement Fine-Tuning

Large Language Model (LLM)-based Multi-Agent Systems (LaMAS) have demonstrated strong capabilities on complex agentic tasks requiring multifaceted reasoning and collaboration, from high-quality presentation generation to scientific research. Meanwhile, Reinforcement Learning (RL) is widely recognized for enhancing agent intelligence, but limited work has studied fine-tuning LaMAS with foundational RL techniques. Directly applying conventional Multi-Agent Reinforcement Learning (MARL) to LaMAS also introduces major challenges due to the unique mechanisms of LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce Flex-MG, a new Markov Game formulation aligned with real-world LaMAS optimization, together with a universal algorithmic framework tailored to LaMAS. We review the evolution from traditional RL to Reinforcement Fine-Tuning (RFT), then analyze the multi-agent counterpart. For LaMAS, we identify key differences between classical MARL and MARFT, including asynchronous agent interactions, profile-aware agent design, and heterogeneous architectures. These differences motivate a LaMAS-oriented formulation of RFT. We present a robust and scalable MARFT framework, detail its modular algorithm, and provide an open-source implementation to support adoption and further research. The paper further discusses application perspectives and open challenges, including dynamic environment modeling, sample inefficiency, and the lack of cohesive frameworks. By connecting theoretical foundations with practical methodology, this work aims to serve as a roadmap for advancing MARFT toward resilient, adaptive, and human-aligned agentic systems. Implementation: https://github.com/jwliao-ai/MARFT.

翻译：基于大型语言模型的多智能体系统（LaMAS）在需要多方面推理与协作的复杂智能体任务中展现出强大能力，涵盖从高质量演示生成到科学研究等场景。与此同时，强化学习被广泛认为能提升智能体能力，但利用基础RL技术对LaMAS进行微调的研究仍较为有限。由于LaMAS的独特机制，直接将传统多智能体强化学习应用于LaMAS会带来重大挑战。为解决这些问题，本文对基于LLM的多智能体强化学习展开系统性研究，并提出多智能体强化微调方法。我们引入Flex-MG——一种与真实LaMAS优化场景对齐的新型马尔可夫博弈形式化框架，并配套提出适配LaMAS的通用算法框架。本文梳理从传统强化学习到强化微调的发展脉络，进而分析对应的多智能体场景。针对LaMAS，我们识别出经典MARL与MARFT的关键差异，包括异步智能体交互、基于配置文件的智能体设计以及异构架构。这些差异促使我们提出面向LaMAS的RFT形式化体系。我们构建了鲁棒且可扩展的MARFT框架，详述其模块化算法，并提供开源实现以支持应用推广与后续研究。本文进一步探讨应用前景与开放挑战，包括动态环境建模、样本效率低下及缺乏统一框架等问题。通过连接理论基石与实践方法论，本研究旨在为推进MARFT走向鲁棒、自适应且与人类价值观对齐的智能体系统提供路线图。代码实现：https://github.com/jwliao-ai/MARFT。