Offline reinforcement learning (RL) has garnered significant attention for its ability to learn effective policies from pre-collected datasets without the need for further environmental interactions. While promising results have been demonstrated in single-agent settings, offline multi-agent reinforcement learning (MARL) presents additional challenges due to the large joint state-action space and the complexity of multi-agent behaviors. A key issue in offline RL is the distributional shift, which arises when the target policy being optimized deviates from the behavior policy that generated the data. This problem is exacerbated in MARL due to the interdependence between agents' local policies and the expansive joint state-action space. Prior approaches have primarily addressed this challenge by incorporating regularization in the space of either Q-functions or policies. In this work, we introduce a regularizer in the space of stationary distributions to better handle distributional shift. Our algorithm, ComaDICE, offers a principled framework for offline cooperative MARL by incorporating stationary distribution regularization for the global learning policy, complemented by a carefully structured multi-agent value decomposition strategy to facilitate multi-agent training. Through extensive experiments on the multi-agent MuJoCo and StarCraft II benchmarks, we demonstrate that ComaDICE achieves superior performance compared to state-of-the-art offline MARL methods across nearly all tasks.
翻译:离线强化学习因其能够从预先收集的数据集中学习有效策略而无需进一步环境交互,已引起广泛关注。尽管在单智能体场景中已展现出良好前景,但离线多智能体强化学习由于庞大的联合状态-动作空间以及多智能体行为的复杂性而面临更多挑战。离线强化学习中的核心问题是分布偏移,即被优化的目标策略偏离生成数据的行为策略。在多智能体强化学习中,由于各智能体局部策略间的相互依赖性以及广阔的联合状态-动作空间,该问题进一步加剧。现有方法主要通过在对数函数空间或策略空间中引入正则化来应对这一挑战。本研究提出在稳态分布空间中引入正则化器以更好地处理分布偏移问题。我们的算法ComaDICE通过为全局学习策略引入稳态分布正则化,并辅以精心设计的多智能体价值分解策略以促进多智能体训练,为离线协作多智能体强化学习提供了理论框架。通过在多智能体MuJoCo和星际争霸II基准测试中的大量实验,我们证明ComaDICE在几乎所有任务中都优于当前最先进的离线多智能体强化学习方法。