Integral to recent successes in deep reinforcement learning has been a class of temporal difference methods that use infrequently updated target values for policy evaluation in a Markov Decision Process. Yet a complete theoretical explanation for the effectiveness of target networks remains elusive. In this work, we provide an analysis of this popular class of algorithms, to finally answer the question: `why do target networks stabilise TD learning'? To do so, we formalise the notion of a partially fitted policy evaluation method, which describes the use of target networks and bridges the gap between fitted methods and semigradient temporal difference algorithms. Using this framework we are able to uniquely characterise the so-called deadly triad - the use of TD updates with (nonlinear) function approximation and off-policy data - which often leads to nonconvergent algorithms. This insight leads us to conclude that the use of target networks can mitigate the effects of poor conditioning in the Jacobian of the TD update. Instead, we show that under mild regularity conditions and a well tuned target network update frequency, convergence can be guaranteed even in the extremely challenging off-policy sampling and nonlinear function approximation setting.
翻译:近来深度强化学习的成功关键,在于一类在马尔可夫决策过程中使用低频更新目标值进行策略评估的时序差分方法。然而,关于目标网络有效性的完整理论解释仍悬而未决。本文针对这类主流算法开展分析,最终解答核心问题:"为什么目标网络能稳定时序差分学习?"为此,我们形式化定义了部分拟合策略评估方法的概念——该方法系统描述了目标网络的使用机制,并弥合了拟合方法与半梯度时序差分算法之间的理论鸿沟。借助该框架,我们首次成功刻画了所谓的"致命三要素"——即使用非线性函数近似与离策略数据的时序差分更新组合——这一通常导致算法不收敛的症结所在。该发现使我们得出结论:目标网络的应用可缓解时序差分更新雅可比矩阵病态条件带来的负面影响。相反,我们证明在温和的正则性条件与恰当调参的目标网络更新频率下,即便在极具挑战性的离策略采样与非线性能量函数近似场景中,算法收敛性仍可得到保证。