We present a novel Diffusion Offline Multi-agent Model (DOM2) for offline Multi-Agent Reinforcement Learning (MARL). Different from existing algorithms that rely mainly on conservatism in policy design, DOM2 enhances policy expressiveness and diversity based on diffusion model. Specifically, we incorporate a diffusion model into the policy network and propose a trajectory-based data-reweighting scheme in training. These key ingredients significantly improve algorithm robustness against environment changes and achieve significant improvements in performance, generalization and data-efficiency. Our extensive experimental results demonstrate that DOM2 outperforms existing state-of-the-art methods in all multi-agent particle and multi-agent MuJoCo environments, and generalizes significantly better to shifted environments {(in $28$ out of $30$ settings evaluated)} thanks to its high expressiveness and diversity. Moreover, DOM2 is ultra data efficient and requires no more than $5\%$ data for achieving the same performance compared to existing algorithms (a $20\times$ improvement in data efficiency).
翻译:我们提出了一种新颖的扩散离线多智能体模型(Diffusion Offline Multi-agent Model, DOM2),用于解决离线多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)问题。与现有主要依赖策略设计中保守性的算法不同,DOM2基于扩散模型增强了策略的表达能力与多样性。具体而言,我们将扩散模型融入策略网络,并在训练中提出了一种基于轨迹的数据重加权方案。这些关键要素显著提升了算法对环境变化的鲁棒性,并在性能、泛化能力与数据效率方面取得了显著改进。大量实验结果表明,在所有多智能体粒子与多智能体MuJoCo环境中,DOM2均优于现有最先进方法;由于其高表达力与多样性,DOM2在迁移环境(在所评估的$30$个设置中有$28$个)中展现出明显更优的泛化能力。此外,DOM2具有超高的数据效率:与现有算法相比,达到相同性能所需数据不超过$5\%$(数据效率提升$20$倍)。