Offline reinforcement learning (RL) has received considerable attention in recent years due to its attractive capability of learning policies from offline datasets without environmental interactions. Despite some success in the single-agent setting, offline multi-agent RL (MARL) remains to be a challenge. The large joint state-action space and the coupled multi-agent behaviors pose extra complexities for offline policy optimization. Most existing offline MARL studies simply apply offline data-related regularizations on individual agents, without fully considering the multi-agent system at the global level. In this work, we present OMIGA, a new offline m ulti-agent RL algorithm with implicit global-to-local v alue regularization. OMIGA provides a principled framework to convert global-level value regularization into equivalent implicit local value regularizations and simultaneously enables in-sample learning, thus elegantly bridging multi-agent value decomposition and policy learning with offline regularizations. Based on comprehensive experiments on the offline multi-agent MuJoCo and StarCraft II micro-management tasks, we show that OMIGA achieves superior performance over the state-of-the-art offline MARL methods in almost all tasks.
翻译:离线强化学习近年来因其能够在不与环境交互的情况下从离线数据集中学习策略的吸引力而受到广泛关注。尽管在单智能体场景中取得了成功,但离线多智能体强化学习仍然是一个挑战。庞大的联合状态-动作空间以及多智能体行为的耦合性为离线策略优化带来了额外复杂性。现有大多数离线多智能体强化学习研究仅对个体智能体应用离线数据相关正则化,而未充分考虑全局层面的多智能体系统。在本文中,我们提出OMIGA——一种具有隐式全局到局部价值正则化的新型离线多智能体强化学习算法。OMIGA提供了一个原则性框架,将全局价值正则化转化为等价的隐式局部价值正则化,并同时实现样本内学习,从而优雅地桥接了多智能体价值分解与策略学习及离线正则化。基于离线多智能体MuJoCo和星际争霸II微操任务的全面实验,我们证明OMIGA在几乎所有任务上均取得了优于最先进离线多智能体强化学习方法的性能。