We prove that the combination of a target network and over-parameterized linear function approximation establishes a weaker convergence condition for bootstrapped value estimation in certain cases, even with off-policy data. Our condition is naturally satisfied for expected updates over the entire state-action space or learning with a batch of complete trajectories from episodic Markov decision processes. Notably, using only a target network or an over-parameterized model does not provide such a convergence guarantee. Additionally, we extend our results to learning with truncated trajectories, showing that convergence is achievable for all tasks with minor modifications, akin to value truncation for the final states in trajectories. Our primary result focuses on temporal difference estimation for prediction, providing high-probability value estimation error bounds and empirical analysis on Baird's counterexample and a Four-room task. Furthermore, we explore the control setting, demonstrating that similar convergence conditions apply to Q-learning.
翻译:我们证明,在某些情况下,目标网络与过参数化线性函数逼近的结合为自助值估计建立了一个较弱的收敛条件,即使使用离策略数据也是如此。我们的条件在整个状态-动作空间上的期望更新或从片段马尔可夫决策过程中学习一批完整轨迹时自然满足。值得注意的是,仅使用目标网络或过参数化模型并不能提供这样的收敛保证。此外,我们将结果扩展到使用截断轨迹的学习,表明通过微小的修改(类似于对轨迹最终状态的值截断),所有任务均可实现收敛。我们的主要结果聚焦于预测的时序差分估计,提供了高概率值估计误差界,并在Baird反例和四房间任务上进行了实证分析。此外,我们探索了控制设置,证明类似的收敛条件适用于Q-learning。