Offline reinforcement learning (RL) aims to learn optimal policies from offline datasets, where the parameterization of policies is crucial but often overlooked. Recently, Diffsuion-QL significantly boosts the performance of offline RL by representing a policy with a diffusion model, whose success relies on a parametrized Markov Chain with hundreds of steps for sampling. However, Diffusion-QL suffers from two critical limitations. 1) It is computationally inefficient to forward and backward through the whole Markov chain during training. 2) It is incompatible with maximum likelihood-based RL algorithms (e.g., policy gradient methods) as the likelihood of diffusion models is intractable. Therefore, we propose efficient diffusion policy (EDP) to overcome these two challenges. EDP approximately constructs actions from corrupted ones at training to avoid running the sampling chain. We conduct extensive experiments on the D4RL benchmark. The results show that EDP can reduce the diffusion policy training time from 5 days to 5 hours on gym-locomotion tasks. Moreover, we show that EDP is compatible with various offline RL algorithms (TD3, CRR, and IQL) and achieves new state-of-the-art on D4RL by large margins over previous methods. Our code is available at https://github.com/sail-sg/edp.
翻译:离线强化学习旨在从离线数据集中学习最优策略,其中策略的参数化形式至关重要却常被忽视。最近,Diffusion-QL通过用扩散模型表示策略显著提升了离线强化学习性能,其成功依赖于包含数百步采样的参数化马尔可夫链。然而,Diffusion-QL存在两个关键局限:1) 训练过程中对整个马尔可夫链进行正向和反向传播的计算效率较低;2) 由于扩散模型的似然函数难以处理,该方法无法与基于最大似然的强化学习算法(如策略梯度方法)兼容。为此,我们提出高效扩散策略以克服这两个挑战。EDP在训练时通过从含噪动作中近似重构动作来避免运行采样链。我们在D4RL基准上进行了大量实验,结果表明在gym-locomotion任务中,EDP能将扩散策略训练时间从5天缩短至5小时。此外,我们证明EDP可与多种离线强化学习算法(TD3、CRR和IQL)兼容,并在D4RL上以显著优势超越先前方法,达到新的最优水平。我们的代码开源在https://github.com/sail-sg/edp。