MABL: Bi-Level Latent-Variable World Model for Sample-Efficient Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning (MARL) methods often suffer from high sample complexity, limiting their use in real-world problems where data is sparse or expensive to collect. Although latent-variable world models have been employed to address this issue by generating abundant synthetic data for MARL training, most of these models cannot encode vital global information available during training into their latent states, which hampers learning efficiency. The few exceptions that incorporate global information assume centralized execution of their learned policies, which is impractical in many applications with partial observability. We propose a novel model-based MARL algorithm, MABL (Multi-Agent Bi-Level world model), that learns a bi-level latent-variable world model from high-dimensional inputs. Unlike existing models, MABL is capable of encoding essential global information into the latent states during training while guaranteeing the decentralized execution of learned policies. For each agent, MABL learns a global latent state at the upper level, which is used to inform the learning of an agent latent state at the lower level. During execution, agents exclusively use lower-level latent states and act independently. Crucially, MABL can be combined with any model-free MARL algorithm for policy learning. In our empirical evaluation with complex discrete and continuous multi-agent tasks including SMAC, Flatland, and MAMuJoCo, MABL surpasses SOTA multi-agent latent-variable world models in both sample efficiency and overall performance.

翻译：多智能体强化学习方法常受高样本复杂度的困扰，限制了其在数据稀疏或采集成本高昂的现实问题中的应用。尽管潜变量世界模型已通过生成丰富的合成数据用于多智能体强化学习训练来解决该问题，但多数模型无法将训练过程中可获取的关键全局信息编码至其潜状态中，这阻碍了学习效率的提升。少数能够整合全局信息的模型假设所学策略的集中式执行，这在许多部分可观测的应用场景中难以实现。我们提出一种新颖的基于模型的多智能体强化学习算法MABL（多智能体双层级世界模型），该算法从高维输入中学习双层级潜变量世界模型。与现有模型不同，MABL能够在训练过程中将关键全局信息编码至潜状态，同时保证所学策略的分散式执行。对于每个智能体，MABL在上层学习一个全局潜状态，用于指导下层智能体潜状态的学习。执行过程中，智能体仅使用下层潜状态并独立行动。关键在于，MABL可与任意无模型多智能体强化学习算法结合进行策略学习。在包含SMAC、Flatland和MAMuJoCo的复杂离散与连续多智能体任务的实证评估中，MABL在样本效率与整体性能上均超越了最先进的多智能体潜变量世界模型。