We propose a novel offline reinforcement learning (offline RL) approach, introducing the Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation (DIAR) framework. We address two key challenges in offline RL: out-of-distribution samples and long-horizon problems. We leverage diffusion models to learn state-action sequence distributions and incorporate value functions for more balanced and adaptive decision-making. DIAR introduces an Adaptive Revaluation mechanism that dynamically adjusts decision lengths by comparing current and future state values, enabling flexible long-term decision-making. Furthermore, we address Q-value overestimation by combining Q-network learning with a value function guided by a diffusion model. The diffusion model generates diverse latent trajectories, enhancing policy robustness and generalization. As demonstrated in tasks like Maze2D, AntMaze, and Kitchen, DIAR consistently outperforms state-of-the-art algorithms in long-horizon, sparse-reward environments.
翻译:我们提出了一种新颖的离线强化学习方法,即基于扩散模型引导的自适应重估隐式Q学习框架。该方法旨在解决离线强化学习中的两个关键挑战:分布外样本与长时程问题。我们利用扩散模型学习状态-动作序列分布,并结合价值函数以实现更平衡且自适应的决策。DIAR引入了一种自适应重估机制,通过比较当前状态与未来状态的价值来动态调整决策长度,从而实现灵活的长时程决策。此外,我们通过将Q网络学习与扩散模型引导的价值函数相结合,解决了Q值高估问题。该扩散模型能够生成多样化的潜在轨迹,从而增强了策略的鲁棒性与泛化能力。在Maze2D、AntMaze和Kitchen等任务上的实验表明,DIAR在长时程、稀疏奖励环境中持续优于现有最先进的算法。