Tabular average reward Temporal Difference (TD) learning is perhaps the simplest and the most fundamental policy evaluation algorithm in average reward reinforcement learning. After at least 25 years since its discovery, we are finally able to provide a long-awaited almost sure convergence analysis. Namely, we are the first to prove that, under very mild conditions, tabular average reward TD converges almost surely to a sample path dependent fixed point. Key to this success is a new general stochastic approximation result concerning nonexpansive mappings with Markovian and additive noise, built on recent advances in stochastic Krasnoselskii-Mann iterations.
翻译:表格型平均奖励时序差分学习可能是平均奖励强化学习中最简单、最基础的策略评估算法。自其发现至少25年后,我们终于能够提供期待已久的几乎必然收敛性分析。具体而言,我们首次证明在非常温和的条件下,表格型平均奖励时序差分学习几乎必然收敛于一个依赖于样本路径的不动点。这一成功的关键在于,基于随机Krasnoselskii-Mann迭代的最新进展,我们建立了一个关于具有马尔可夫噪声和加性噪声的非扩张映射的新的一般随机逼近结果。