Learning high-quality Q-value functions plays a key role in the success of many modern off-policy deep reinforcement learning (RL) algorithms. Previous works focus on addressing the value overestimation issue, an outcome of adopting function approximators and off-policy learning. Deviating from the common viewpoint, we observe that Q-values are indeed underestimated in the latter stage of the RL training process, primarily related to the use of inferior actions from the current policy in Bellman updates as compared to the more optimal action samples in the replay buffer. We hypothesize that this long-neglected phenomenon potentially hinders policy learning and reduces sample efficiency. Our insight to address this issue is to incorporate sufficient exploitation of past successes while maintaining exploration optimism. We propose the Blended Exploitation and Exploration (BEE) operator, a simple yet effective approach that updates Q-value using both historical best-performing actions and the current policy. The instantiations of our method in both model-free and model-based settings outperform state-of-the-art methods in various continuous control tasks and achieve strong performance in failure-prone scenarios and real-world robot tasks.
翻译:学习高质量的Q值函数在许多现代离策略深度强化学习算法的成功中起着关键作用。先前的工作主要集中在解决值高估问题上,这是采用函数近似器和离策略学习带来的结果。与常见观点不同,我们观察到在强化学习训练过程的后阶段,Q值实际上被低估了,这主要与在贝尔曼更新中使用当前策略的劣质动作,相较于回放缓冲区中的更优动作样本有关。我们假设这一长期被忽视的现象可能阻碍策略学习并降低样本效率。解决这一问题的关键在于,在保持探索乐观性的同时,充分挖掘过去的成功经验。我们提出了混合利用与探索算子,这是一种简单而有效的方法,通过同时使用历史最优动作和当前策略来更新Q值。我们的方法在无模型和基于模型的设置中,均优于各种连续控制任务中的最先进方法,并在易失败场景和真实机器人任务中展现出强劲性能。