Offline reinforcement learning (RL) aims to learn optimal policies from offline datasets, where the parameterization of policies is crucial but often overlooked. Recently, Diffsuion-QL significantly boosts the performance of offline RL by representing a policy with a diffusion model, whose success relies on a parametrized Markov Chain with hundreds of steps for sampling. However, Diffusion-QL suffers from two critical limitations. 1) It is computationally inefficient to forward and backward through the whole Markov chain during training. 2) It is incompatible with maximum likelihood-based RL algorithms (e.g., policy gradient methods) as the likelihood of diffusion models is intractable. Therefore, we propose efficient diffusion policy (EDP) to overcome these two challenges. EDP approximately constructs actions from corrupted ones at training to avoid running the sampling chain. We conduct extensive experiments on the D4RL benchmark. The results show that EDP can reduce the diffusion policy training time from 5 days to 5 hours on gym-locomotion tasks. Moreover, we show that EDP is compatible with various offline RL algorithms (TD3, CRR, and IQL) and achieves new state-of-the-art on D4RL by large margins over previous methods. Our code is available at https://github.com/sail-sg/edp.
翻译:离线强化学习旨在从离线数据集中学习最优策略,其中策略的参数化至关重要却常被忽视。近期,Diffusion-QL通过采用扩散模型表示策略显著提升了离线强化学习的性能,其成功依赖于包含数百步采样的参数化马尔可夫链。然而,Diffusion-QL存在两大关键局限性:1) 训练过程中需对整个马尔可夫链执行前向和反向传播,计算效率低下;2) 由于扩散模型的似然函数难以求解,其无法兼容基于最大似然的强化学习算法(如策略梯度方法)。为此,我们提出高效扩散策略以克服上述挑战。EDP在训练阶段通过从噪声扰动动作中近似重构动作,避免运行采样链。我们在D4RL基准上开展大量实验。结果表明,在gym-locomotion任务中,EDP可将扩散策略训练时间从5天缩短至5小时。此外,我们证明EDP可兼容多种离线强化学习算法(TD3、CRR和IQL),并在D4RL基准上以大幅优势超越先前方法,达到全新最优性能。我们的代码已开源在 https://github.com/sail-sg/edp。