We address the issue of estimation bias in deep reinforcement learning (DRL) by introducing solution mechanisms that include a new, twin TD-regularized actor-critic (TDR) method. It aims at reducing both over and under-estimation errors. With TDR and by combining good DRL improvements, such as distributional learning and long N-step surrogate stage reward (LNSS) method, we show that our new TDR-based actor-critic learning has enabled DRL methods to outperform their respective baselines in challenging environments in DeepMind Control Suite. Furthermore, they elevate TD3 and SAC respectively to a level of performance comparable to that of D4PG (the current SOTA), and they also improve the performance of D4PG to a new SOTA level measured by mean reward, convergence speed, learning success rate, and learning variance.
翻译:我们通过引入包括新型孪生TD正则化演员-评论家(TDR)方法在内的解决方案机制,应对深度强化学习中的估计偏差问题。该方法旨在同时减少过度估计和欠估计误差。通过TDR并结合良好的深度强化学习改进技术(如分布式学习与长N步替代阶段奖励方法),我们证明基于TDR的新型演员-评论家学习方法能使深度强化学习方法在DeepMind控制套件的挑战性环境中超越各自基线性能。此外,该方法分别将TD3和SAC提升至与D4PG(当前最优方法)相当的性能水平,并通过平均奖励、收敛速度、学习成功率和学习方差等指标,将D4PG的性能提升至新的最优水平。