Integral to recent successes in deep reinforcement learning has been a class of temporal difference methods that use infrequently updated target values for policy evaluation in a Markov Decision Process. Yet a complete theoretical explanation for the effectiveness of target networks remains elusive. In this work, we provide an analysis of this popular class of algorithms, to finally answer the question: `why do target networks stabilise TD learning'? To do so, we formalise the notion of a partially fitted policy evaluation method, which describes the use of target networks and bridges the gap between fitted methods and semigradient temporal difference algorithms. Using this framework we are able to uniquely characterise the so-called deadly triad - the use of TD updates with (nonlinear) function approximation and off-policy data - which often leads to nonconvergent algorithms. This insight leads us to conclude that the use of target networks can mitigate the effects of poor conditioning in the Jacobian of the TD update. Instead, we show that under mild regularity conditions and a well tuned target network update frequency, convergence can be guaranteed even in the extremely challenging off-policy sampling and nonlinear function approximation setting.
翻译:在深度强化学习的最新成功中,一类时间差分方法发挥着核心作用,这类方法在马尔可夫决策过程(MDP)的策略评估中使用低频更新的目标值。然而,关于目标网络有效性的完整理论解释仍不明确。本文对此类流行算法进行了分析,最终回答了以下问题:“为什么目标网络能稳定时间差分学习?”为此,我们形式化了部分拟合策略评估方法的概念,该方法描述了目标网络的使用,并弥合了拟合方法与半梯度时间差分算法之间的差距。利用这一框架,我们能够唯一地刻画所谓的“致命三要素”——即使用带有(非线性)函数逼近和离策略数据的时间差分更新——这通常会导致算法无法收敛。这一见解使我们得出结论:目标网络的使用可以缓解时间差分更新雅可比矩阵中条件数不良的影响。相反,我们证明,在温和的正则性条件以及经过良好调整的目标网络更新频率下,即使在极具挑战性的离策略采样和非线性函数逼近设置中,收敛性也能得到保证。