Offline multi-agent reinforcement learning (MARL) leverages static datasets of experience to learn optimal multi-agent control. However, learning from static data presents several unique challenges to overcome. In this paper, we focus on coordination failure and investigate the role of joint actions in multi-agent policy gradients with offline data, focusing on a common setting we refer to as the 'Best Response Under Data' (BRUD) approach. By using two-player polynomial games as an analytical tool, we demonstrate a simple yet overlooked failure mode of BRUD-based algorithms, which can lead to catastrophic coordination failure in the offline setting. Building on these insights, we propose an approach to mitigate such failure, by prioritising samples from the dataset based on joint-action similarity during policy learning and demonstrate its effectiveness in detailed experiments. More generally, however, we argue that prioritised dataset sampling is a promising area for innovation in offline MARL that can be combined with other effective approaches such as critic and policy regularisation. Importantly, our work shows how insights drawn from simplified, tractable games can lead to useful, theoretically grounded insights that transfer to more complex contexts. A core dimension of offering is an interactive notebook, from which almost all of our results can be reproduced, in a browser.
翻译:离线多智能体强化学习(MARL)利用静态经验数据集来学习最优的多智能体控制策略。然而,从静态数据中学习面临着若干独特的挑战需要克服。本文聚焦于协调失效问题,研究了离线数据下多智能体策略梯度中联合动作的作用,重点关注我们称为"数据下最优响应"(BRUD)的常见设定。通过使用双智能体多项式博弈作为分析工具,我们揭示了基于BRUD算法的一个简单但被忽视的失效模式,该模式可能导致离线设定下的灾难性协调失效。基于这些发现,我们提出了一种缓解此类失效的方法——在策略学习过程中根据联合动作相似性对数据集样本进行优先级排序,并通过详细实验验证了其有效性。更广泛而言,我们认为优先级数据集采样是离线MARL中具有前景的创新方向,可与批评函数正则化、策略正则化等其他有效方法结合使用。重要的是,我们的研究表明,从简化可处理的博弈中获得的洞见能够转化为适用于更复杂场景的、具有理论依据的实用见解。本研究的核心成果是一个交互式笔记本,几乎所有结果均可通过浏览器在该笔记本中复现。