Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine-tuning, which suffer from high variance and thus require more samples to train effectively. Actor-critic methods are prevalent in MARL for dealing with these issues; thus, we developed Multi-Agent Actor-Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, \textbf{CoLLM-CC} with a \textbf{C}entralized \textbf{C}ritic and \textbf{CoLLM-DC} with \textbf{D}ecentralized \textbf{C}ritics. Our experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge.
翻译:近期研究探索了通过多智能体强化学习(MARL)优化大语言模型(LLM)协作的方法。然而,多数MARL微调方法依赖预定义的执行协议,往往需要集中式执行。去中心化LLM协作在实践中更具吸引力,因为智能体可实现并行推理与灵活部署。此外,现有方法采用蒙特卡洛方法进行微调,由于该方法存在高方差问题,需要更多样本才能实现有效训练。演员-评论家方法在MARL中广泛用于应对这些问题,因此我们开发了多智能体演员-评论家(MAAC)方法以优化去中心化LLM协作。本文分析了MAAC方法的适用场景及其优势原理,并提出两种MAAC方案:采用**集中式**评论家的 \textbf{CoLLM-CC} 与采用**去中心化**评论家的 \textbf{CoLLM-DC}。在写作、编程和游戏博弈领域的实验表明:在短视界与密集奖励场景中,蒙特卡洛方法与CoLLM-DC可取得与CoLLM-CC相当的性能;但在长视界或稀疏奖励任务中,两者均逊于CoLLM-CC——蒙特卡洛方法需要显著更多的样本,而CoLLM-DC则难以收敛。