Bootstrapping is behind much of the successes of deep Reinforcement Learning. However, learning the value function via bootstrapping often leads to unstable training due to fast-changing target values. Target Networks are employed to stabilize training by using an additional set of lagging parameters to estimate the target values. Despite the popularity of Target Networks, their effect on the optimization is still misunderstood. In this work, we show that they act as an implicit regularizer which can be beneficial in some cases, but also have disadvantages such as being inflexible and can result in instabilities, even when vanilla TD(0) converges. To overcome these issues, we propose an explicit Functional Regularization alternative that is flexible and a convex regularizer in function space and we theoretically study its convergence. We conduct an experimental study across a range of environments, discount factors, and off-policiness data collections to investigate the effectiveness of the regularization induced by Target Networks and Functional Regularization in terms of performance, accuracy, and stability. Our findings emphasize that Functional Regularization can be used as a drop-in replacement for Target Networks and result in performance improvement. Furthermore, adjusting both the regularization weight and the network update period in Functional Regularization can result in further performance improvements compared to solely adjusting the network update period as typically done with Target Networks. Our approach also enhances the ability to networks to recover accurate $Q$-values.
翻译:自举法(Bootstrapping)是深度强化学习取得诸多成功背后的关键。然而,通过自举法学习值函数通常会导致训练不稳定,原因在于目标值快速变化。目标网络(Target Networks)采用一组滞后的参数来估计目标值,以稳定训练过程。尽管目标网络广泛流行,但其对优化的影响仍未被充分理解。在本研究中,我们表明目标网络扮演着隐式正则化器的角色,这在某些情况下有益,但也存在不灵活、甚至可能导致不稳定等缺点——即使经典的TD(0)算法已收敛。为克服这些问题,我们提出了一种显式的函数正则化(Functional Regularization)替代方案,该方案灵活且是函数空间中的凸正则化器,并从理论上研究了其收敛性。我们在一系列环境、折扣因子以及离策略(off-policy)数据收集条件下开展实验研究,以探究目标网络与函数正则化诱导的正则化在性能、准确性和稳定性方面的有效性。我们的发现强调,函数正则化可作为目标网络的即插即用替代方案,并带来性能提升。此外,与目标网络通常仅调整网络更新周期不同,在函数正则化中同时调整正则化权重和网络更新周期可带来进一步的性能提升。我们的方法还增强了网络恢复精确$Q$值的能力。