Continuous control Deep Reinforcement Learning (RL) approaches are known to suffer from estimation biases, leading to suboptimal policies. This paper introduces innovative methods in RL, focusing on addressing and exploiting estimation biases in Actor-Critic methods for continuous control tasks, using Deep Double Q-Learning. We design a Bias Exploiting (BE) mechanism to dynamically select the most advantageous estimation bias during training of the RL agent. Most State-of-the-art Deep RL algorithms can be equipped with the BE mechanism, without hindering performance or computational complexity. Our extensive experiments across various continuous control tasks demonstrate the effectiveness of our approaches. We show that RL algorithms equipped with this method can match or surpass their counterparts, particularly in environments where estimation biases significantly impact learning. The results underline the importance of bias exploitation in improving policy learning in RL.
翻译:连续控制深度强化学习方法普遍存在估计偏差问题,这往往导致策略性能欠佳。本文针对连续控制任务中的Actor-Critic方法,基于深度双Q学习框架,提出了一系列创新方法以应对并利用估计偏差。我们设计了一种偏差利用机制,能够在强化学习智能体训练过程中动态选择最具优势的估计偏差。该机制可适配于大多数先进深度强化学习算法,且不会影响算法性能或增加计算复杂度。我们在多种连续控制任务上进行了大量实验,结果验证了所提方法的有效性。研究表明,配备该机制的强化学习算法能够达到甚至超越原有算法性能,尤其在估计偏差对学习过程影响显著的环境中表现突出。这些结果凸显了偏差利用机制在改进强化学习策略学习方面的重要价值。