``Distribution shift'' is the main obstacle to the success of offline reinforcement learning. A learning policy may take actions beyond the behavior policy's knowledge, referred to as Out-of-Distribution (OOD) actions. The Q-values for these OOD actions can be easily overestimated. As a result, the learning policy is biased by using incorrect Q-value estimates. One common approach to avoid Q-value overestimation is to make a pessimistic adjustment. Our key idea is to penalize the Q-values of OOD actions associated with high uncertainty. In this work, we propose Q-Distribution Guided Q-Learning (QDQ), which applies a pessimistic adjustment to Q-values in OOD regions based on uncertainty estimation. This uncertainty measure relies on the conditional Q-value distribution, learned through a high-fidelity and efficient consistency model. Additionally, to prevent overly conservative estimates, we introduce an uncertainty-aware optimization objective for updating the Q-value function. The proposed QDQ demonstrates solid theoretical guarantees for the accuracy of Q-value distribution learning and uncertainty measurement, as well as the performance of the learning policy. QDQ consistently shows strong performance on the D4RL benchmark and achieves significant improvements across many tasks.
翻译:"分布偏移"是离线强化学习成功的主要障碍。学习策略可能采取超出行为策略知识范围的动作,这些动作被称为分布外动作。对于这些分布外动作的Q值很容易被高估。因此,学习策略会因使用错误的Q值估计而产生偏差。避免Q值高估的一种常见方法是进行悲观调整。我们的核心思想是对具有高不确定性的分布外动作的Q值进行惩罚。在本工作中,我们提出了Q分布引导的Q学习,该方法基于不确定性估计对分布外区域的Q值进行悲观调整。这种不确定性度量依赖于条件Q值分布,该分布通过一个高保真且高效的一致性模型学习得到。此外,为防止估计过于保守,我们引入了一个不确定性感知的优化目标来更新Q值函数。所提出的QDQ方法为Q值分布学习和不确定性度量的准确性,以及学习策略的性能提供了坚实的理论保证。QDQ在D4RL基准测试中持续表现出强大的性能,并在许多任务上取得了显著提升。