Training multiple agents to coordinate is an essential problem with applications in robotics, game theory, economics, and social sciences. However, most existing Multi-Agent Reinforcement Learning (MARL) methods are online and thus impractical for real-world applications in which collecting new interactions is costly or dangerous. While these algorithms should leverage offline data when available, doing so gives rise to what we call the offline coordination problem. Specifically, we identify and formalize the strategy agreement (SA) and the strategy fine-tuning (SFT) coordination challenges, two issues at which current offline MARL algorithms fail. Concretely, we reveal that the prevalent model-free methods are severely deficient and cannot handle coordination-intensive offline multi-agent tasks in either toy or MuJoCo domains. To address this setback, we emphasize the importance of inter-agent interactions and propose the very first model-based offline MARL method. Our resulting algorithm, Model-based Offline Multi-Agent Proximal Policy Optimization (MOMA-PPO) generates synthetic interaction data and enables agents to converge on a strategy while fine-tuning their policies accordingly. This simple model-based solution solves the coordination-intensive offline tasks, significantly outperforming the prevalent model-free methods even under severe partial observability and with learned world models.
翻译:训练多个智能体进行协调是一个在机器人学、博弈论、经济学和社会科学等领域具有重要应用的关键问题。然而,现有的大多数多智能体强化学习(MARL)方法均基于在线学习,因此不适用于那些采集新交互代价高昂或存在危险的真实场景。尽管这些算法应在可用时充分利用离线数据,但这样做会引发我们称之为“离线协调问题”的挑战。具体而言,我们识别并形式化了两种协调难题:策略一致性(SA)和策略微调(SFT),而当前离线MARL算法在这两类问题上均表现失败。研究揭示,主流的无模型方法存在严重缺陷,即使在玩具域或MuJoCo域中,也无法处理高协调强度的离线多智能体任务。为解决这一困境,我们强调智能体间交互的重要性,并提出了首个基于模型的离线MARL方法。所提出的算法——基于模型的离线多智能体近端策略优化(MOMA-PPO),通过生成合成交互数据,使智能体在协调达成策略的同时相应微调其策略。这种简单的基于模型方案成功解决了高协调强度的离线任务,即使在严重部分可观测性及使用学习到的世界模型条件下,其性能也显著优于主流无模型方法。