Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine-tuning, which suffer from high variance and thus require more samples to train effectively. Actor-critic methods are prevalent in MARL for dealing with these issues, so we developed Multi-Agent Actor-Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, \textbf{CoLLM-CC} with a \textbf{C}entralized \textbf{C}ritic and \textbf{CoLLM-DC} with \textbf{D}ecentralized \textbf{C}ritics. Our experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge. Our code is available at https://github.com/OpenMLRL/CoMLRL/releases/tag/v1.3.2.
翻译:近期研究探索了通过多智能体强化学习优化大语言模型协作。然而,大多数基于MARL的微调方法依赖于预定义执行协议,通常需要集中式执行。分散式LLM协作在实践中更具吸引力,因为智能体可以并行执行推理且部署灵活。此外,当前方法使用蒙特卡洛方法进行微调,存在高方差问题,因此需要更多样本才能有效训练。演员-评论家方法在MARL中常用于处理此类问题,为此我们开发了多智能体演员-评论家方法以优化分散式LLM协作。本文分析了这些MAAC方法何时及为何有效。我们提出两种MAAC方案:采用集中式评论家的\textbf{CoLLM-CC}与采用分散式评论家的\textbf{CoLLM-DC}。我们在写作、编程和游戏博弈领域的实验表明,在短周期和密集奖励场景中,蒙特卡洛方法与CoLLM-DC能达到与CoLLM-CC相当的性能。但在长周期或稀疏奖励任务中,两者均表现不及CoLLM-CC——蒙特卡洛方法需要显著更多样本,而CoLLM-DC则难以收敛。代码发布于https://github.com/OpenMLRL/CoMLRL/releases/tag/v1.3.2。