The target network update frequency (TUF) is a central stabilization mechanism in (deep) Q-learning. However, their selection remains poorly understood and is often treated merely as another tunable hyperparameter rather than as a principled design decision. This work provides a theoretical analysis of target fixing in tabular Q-learning through the lens of approximate dynamic programming. We formulate periodic target updates as a nested optimization scheme in which each outer iteration applies an inexact Bellman optimality operator, approximated by a generic inner loop optimizer. Rigorous theory yields a finite-time convergence analysis for the asynchronous sampling setting, specializing to stochastic gradient descent in the inner loop. Our results deliver an explicit characterization of the bias-variance trade-off induced by the target update period, showing how to optimally set this critical hyperparameter. We prove that constant target update schedules are suboptimal, incurring a logarithmic overhead in sample complexity that is entirely avoidable with adaptive schedules. Our analysis shows that the optimal target update frequency increases geometrically over the course of the learning process.
翻译:目标网络更新频率(TUF)是(深度)Q学习中的核心稳定机制。然而,其选择原则至今仍缺乏深入理解,通常仅被视为另一个可调超参数,而非基于原理的设计决策。本文通过近似动态规划的视角,对表格型Q学习中的目标固定机制进行了理论分析。我们将周期性目标更新建模为一种嵌套优化方案,其中每个外层迭代应用一个不精确的贝尔曼最优算子,该算子由通用的内层循环优化器近似。严格的理论推导为异步采样场景提供了有限时间收敛性分析,并特别针对内层循环采用随机梯度下降的情况进行了专门化。我们的研究结果明确刻画了目标更新周期引起的偏差-方差权衡,揭示了如何最优设置这一关键超参数。我们证明了恒定目标更新方案是次优的,会导致样本复杂度的对数级开销,而这种开销完全可以通过自适应调度方案避免。分析表明,最优目标更新频率在学习过程中呈几何级数增长。