Distribution shift is a major obstacle in offline reinforcement learning, which necessitates minimizing the discrepancy between the learned policy and the behavior policy to avoid overestimating rare or unseen actions. Previous conservative offline RL algorithms struggle to generalize to unseen actions, despite their success in learning good in-distribution policy. In contrast, we propose to use the gradient fields of the dataset density generated from a pre-trained offline RL algorithm to adjust the original actions. We decouple the conservatism constraints from the policy, thus can benefit wide offline RL algorithms. As a consequence, we propose the Conservative Denoising Score-based Algorithm (CDSA) which utilizes the denoising score-based model to model the gradient of the dataset density, rather than the dataset density itself, and facilitates a more accurate and efficient method to adjust the action generated by the pre-trained policy in a deterministic and continuous MDP environment. In experiments, we show that our approach significantly improves the performance of baseline algorithms in D4RL datasets, and demonstrate the generalizability and plug-and-play capability of our model across different pre-trained offline RL policy in different tasks. We also validate that the agent exhibits greater risk aversion after employing our method while showcasing its ability to generalize effectively across diverse tasks.
翻译:分布偏移是离线强化学习中的主要障碍,它需要最小化学习策略与行为策略之间的差异,以避免高估罕见或未见动作。以往的保守离线强化学习算法虽在学习良好的分布内策略方面取得成功,但在泛化至未见动作时存在困难。相比之下,我们提出利用预训练离线强化学习算法生成的数据集密度的梯度场来调整原始动作。我们将保守性约束从策略中解耦,从而能够惠及广泛的离线强化学习算法。据此,我们提出基于保守去噪得分的算法(CDSA),该算法利用去噪得分模型对数据集密度的梯度进行建模(而非数据集密度本身),并提供一种更精确高效的方法,在确定性和连续的马尔可夫决策过程(MDP)环境中调整由预训练策略生成的动作。实验表明,我们的方法显著提升了基线算法在D4RL数据集上的性能,并展示了模型在不同任务中跨不同预训练离线强化学习策略的泛化能力与即插即用特性。我们还验证了,在应用该方法后,智能体展现出更强的风险规避性,同时能够有效泛化至多样任务。