The average reward is a fundamental performance metric in reinforcement learning (RL) focusing on the long-run performance of an agent. Differential temporal difference (TD) learning algorithms are a major advance for average reward RL as they provide an efficient online method to learn the value functions associated with the average reward in both on-policy and off-policy settings. However, existing convergence guarantees require a local clock in learning rates tied to state visit counts, which practitioners do not use and does not extend beyond tabular settings. We address this limitation by proving the almost sure convergence of on-policy $n$-step differential TD for any $n$ using standard diminishing learning rates without a local clock. We then derive three sufficient conditions under which off-policy $n$-step differential TD also converges without a local clock. These results strengthen the theoretical foundations of differential TD and bring its convergence analysis closer to practical implementations.
翻译:平均奖励是强化学习中的一个基本性能指标,关注智能体的长期表现。差分时序差分学习算法是平均奖励强化学习领域的一项重要进展,它们提供了一种高效在线方法,用于在策略内和策略外两种设置下学习与平均奖励相关的价值函数。然而,现有的收敛性保证要求学习率中设置与状态访问次数绑定的局部时钟,这在实际应用中并未被采用,且无法推广到表格化设置之外。我们通过证明在任意$n$值下,使用标准的递减学习率(无需局部时钟)的策略内$n$步差分时序差分算法的几乎必然收敛性,解决了这一局限性。随后,我们推导出三个充分条件,在这些条件下,策略外$n$步差分时序差分算法同样可以在没有局部时钟的情况下收敛。这些结果强化了差分时序差分算法的理论基础,并将其收敛性分析更贴近于实际实现。