Bootstrapping is behind much of the successes of Deep Reinforcement Learning. However, learning the value function via bootstrapping often leads to unstable training due to fast-changing target values. Target Networks are employed to stabilize training by using an additional set of lagging parameters to estimate the target values. Despite the popularity of Target Networks, their effect on the optimization is still misunderstood. In this work, we show that they act as an implicit regularizer. This regularizer has disadvantages such as being inflexible and non convex. To overcome these issues, we propose an explicit Functional Regularization that is a convex regularizer in function space and can easily be tuned. We analyze the convergence of our method theoretically and empirically demonstrate that replacing Target Networks with the more theoretically grounded Functional Regularization approach leads to better sample efficiency and performance improvements.
翻译:自举法是深度强化学习取得诸多成功的关键。然而,基于自举法学习价值函数常因目标值快速变化而导致训练不稳定。目标网络通过使用一组滞后的参数来估计目标值以稳定训练。尽管目标网络应用广泛,但其对优化的影响仍未被充分理解。本研究表明,目标网络实际上充当了隐式正则化器。这种正则化器存在诸如缺乏灵活性和非凸性等缺陷。为解决这些问题,我们提出了一种显式函数正则化方法,它在函数空间中具有凸性且易于调整。我们从理论上分析了该方法的收敛性,并通过实验证明,用更具理论依据的函数正则化方法替代目标网络可提升样本效率与性能表现。