Training multiple agents to coordinate is an important problem with applications in robotics, game theory, economics, and social sciences. However, most existing Multi-Agent Reinforcement Learning (MARL) methods are online and thus impractical for real-world applications in which collecting new interactions is costly or dangerous. While these algorithms should leverage offline data when available, doing so gives rise to the offline coordination problem. Specifically, we identify and formalize the strategy agreement (SA) and the strategy fine-tuning (SFT) challenges, two coordination issues at which current offline MARL algorithms fail. To address this setback, we propose a simple model-based approach that generates synthetic interaction data and enables agents to converge on a strategy while fine-tuning their policies accordingly. Our resulting method, Model-based Offline Multi-Agent Proximal Policy Optimization (MOMA-PPO), outperforms the prevalent learning methods in challenging offline multi-agent MuJoCo tasks even under severe partial observability and with learned world models.
翻译:训练多个智能体进行协调是一个重要问题,在机器人学、博弈论、经济学和社会科学中均有应用。然而,现有的大多数多智能体强化学习(MARL)方法都是在线算法,因此在收集新交互成本高昂或存在危险的现实应用中不切实际。虽然这些算法应在可用时利用离线数据,但这会引发离线协调问题。具体而言,我们识别并形式化了策略一致(SA)和策略微调(SFT)两大挑战——当前离线MARL算法在这两个协调问题上均表现失效。为解决这一难题,我们提出一种简单的基于模型的方法,通过生成合成交互数据使智能体在微调各自策略的同时收敛于协同策略。我们的最终方法——基于模型的离线多智能体近端策略优化(MOMA-PPO)——即使在严重部分可观测性条件下且使用学习到的世界模型时,在具有挑战性的离线多智能体MuJoCo任务中仍优于主流学习方法。