Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine-tuning, which suffer from high variance and thus require more samples to train effectively. Actor-critic methods are prevalent in MARL for dealing with these issues, so we developed Multi-Agent Actor-Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, \textbf{CoLLM-CC} with a \textbf{C}entralized \textbf{C}ritic and \textbf{CoLLM-DC} with \textbf{D}ecentralized \textbf{C}ritics. Our experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge. Our code is available at https://github.com/OpenMLRL/CoMLRL/releases/tag/v1.3.6.
翻译:近期研究探索了通过多智能体强化学习优化大语言模型协作。然而,大多数基于多智能体强化学习的微调方法依赖于预定义执行协议,通常需要集中式执行。分散式大语言模型协作在实践中更具吸引力,因为智能体可通过灵活部署实现并行推理。此外,现有方法采用蒙特卡洛方法进行微调,存在高方差问题,导致有效训练需要更多样本。演员-评论家方法在处理此类问题时广泛应用于多智能体强化学习领域,为此我们开发了多智能体演员-评论家方法以优化分散式大语言模型协作。本文系统分析了这些方法何时及为何有效,提出了两种多智能体演员-评论家方案:采用集中式评论家的 \textbf{CoLLM-CC} 与采用分散式评论家的 \textbf{CoLLM-DC}。我们在写作、编程和游戏博弈领域的实验表明:在短周期与密集奖励场景中,蒙特卡洛方法与 CoLLM-DC 能达到与 CoLLM-CC 相当的性能;但在长周期或稀疏奖励任务中,两者均表现不佳——蒙特卡洛方法需要显著更多训练样本,而 CoLLM-DC 则难以收敛。代码已发布于 https://github.com/OpenMLRL/CoMLRL/releases/tag/v1.3.6。