Learning high-quality Q-value functions plays a key role in the success of many modern off-policy deep reinforcement learning (RL) algorithms. Previous works focus on addressing the value overestimation issue, an outcome of adopting function approximators and off-policy learning. Deviating from the common viewpoint, we observe that Q-values are indeed underestimated in the latter stage of the RL training process, primarily related to the use of inferior actions from the current policy in Bellman updates as compared to the more optimal action samples in the replay buffer. We hypothesize that this long-neglected phenomenon potentially hinders policy learning and reduces sample efficiency. Our insight to address this issue is to incorporate sufficient exploitation of past successes while maintaining exploration optimism. We propose the Blended Exploitation and Exploration (BEE) operator, a simple yet effective approach that updates Q-value using both historical best-performing actions and the current policy. The instantiations of our method in both model-free and model-based settings outperform state-of-the-art methods in various continuous control tasks and achieve strong performance in failure-prone scenarios and real-world robot tasks.
翻译:学习高质量的Q值函数在许多现代离线策略深度强化学习算法的成功中起着关键作用。先前的工作主要关注值过估计问题,这是采用函数近似器和离线策略学习的结果。与常见观点不同,我们观察到在强化学习训练过程的后期阶段,Q值实际上被低估了,这主要与贝尔曼更新中使用当前策略下的较差动作(相较于经验回放池中的更优动作样本)有关。我们假设这一长期被忽视的现象可能阻碍策略学习并降低样本效率。解决这一问题的见解是:在保持探索乐观性的同时,充分挖掘过往成功的价值。我们提出混合开发与探索算子,这是一种简单而有效的方法,通过结合历史最优动作与当前策略来更新Q值。该方法在无模型和基于模型的设置中的实例化,在各种连续控制任务上优于最先进的方法,并在故障易发场景及真实机器人任务中取得了强劲性能。