This paper introduces innovative methods in Reinforcement Learning (RL), focusing on addressing and exploiting estimation biases in Actor-Critic methods for continuous control tasks, using Deep Double Q-Learning. We propose two novel algorithms: Expectile Delayed Deep Deterministic Policy Gradient (ExpD3) and Bias Exploiting - Twin Delayed Deep Deterministic Policy Gradient (BE-TD3). ExpD3 aims to reduce overestimation bias with a single $Q$ estimate, offering a balance between computational efficiency and performance, while BE-TD3 is designed to dynamically select the most advantageous estimation bias during training. Our extensive experiments across various continuous control tasks demonstrate the effectiveness of our approaches. We show that these algorithms can either match or surpass existing methods like TD3, particularly in environments where estimation biases significantly impact learning. The results underline the importance of bias exploitation in improving policy learning in RL.
翻译:本文介绍了强化学习中的创新方法,重点针对连续控制任务中Actor-Critic方法的估计偏差进行解决与利用,采用深度双Q学习技术。我们提出了两种新型算法:分位数延迟深度确定性策略梯度(ExpD3)和偏差利用型双延迟深度确定性策略梯度(BE-TD3)。ExpD3旨在通过单个$Q$估计减少过估计偏差,在计算效率与性能之间取得平衡;而BE-TD3则设计为在训练过程中动态选择最有利的估计偏差。我们在多种连续控制任务上进行了大量实验,验证了所提方法的有效性。结果表明,这些算法能够匹配甚至超越现有方法(如TD3),尤其在估计偏差显著影响学习的环境中表现突出。研究结果强调了偏差利用在强化学习策略优化中的重要性。