We present a novel Diffusion Offline Multi-agent Model (DOM2) for offline Multi-Agent Reinforcement Learning (MARL). Different from existing algorithms that rely mainly on conservatism in policy design, DOM2 enhances policy expressiveness and diversity based on diffusion. Specifically, we incorporate a diffusion model into the policy network and propose a trajectory-based data-augmentation scheme in training. These key ingredients make our algorithm more robust to environment changes and achieve significant improvements in performance, generalization and data-efficiency. Our extensive experimental results demonstrate that DOM2 outperforms existing state-of-the-art methods in multi-agent particle and multi-agent MuJoCo environments, and generalizes significantly better in shifted environments thanks to its high expressiveness and diversity. Furthermore, DOM2 shows superior data efficiency and can achieve state-of-the-art performance with $20+$ times less data compared to existing algorithms.
翻译:我们提出了一种新颖的离线扩散多智能体模型(DOM2),用于离线多智能体强化学习(MARL)。与现有主要依赖策略设计中的保守主义的算法不同,DOM2基于扩散增强了策略的表达能力和多样性。具体而言,我们将扩散模型融入策略网络,并提出了一种基于轨迹的训练数据增强方案。这些关键要素使得我们的算法对环境变化更具鲁棒性,并在性能、泛化能力和数据效率方面实现了显著提升。广泛的实验结果表明,DOM2在多智能体粒子环境和多智能体MuJoCo环境中优于现有最先进方法,并且由于其高表达能力和多样性,在环境发生偏移时表现出显著更好的泛化能力。此外,DOM2展现出卓越的数据效率,仅需现有算法20倍以上的数据即可达到最先进性能。