SARSA, a classical on-policy control algorithm for reinforcement learning, is known to chatter when combined with linear function approximation: SARSA does not diverge but oscillates in a bounded region. However, little is known about how fast SARSA converges to that region and how large the region is. In this paper, we make progress towards this open problem by showing the convergence rate of projected SARSA to a bounded region. Importantly, the region is much smaller than the region that we project into, provided that the magnitude of the reward is not too large. Existing works regarding the convergence of linear SARSA to a fixed point all require the Lipschitz constant of SARSA's policy improvement operator to be sufficiently small; our analysis instead applies to arbitrary Lipschitz constants and thus characterizes the behavior of linear SARSA for a new regime.
翻译:SARSA是一种经典的强化学习在线策略控制算法,已知在与线性函数逼近结合时会出现震荡现象:SARSA不会发散,但会在有界区域内振荡。然而,关于SARSA收敛到该区域的速度以及该区域的大小,目前知之甚少。本文针对这一开放问题取得进展,证明了投影SARSA收敛到有界区域的收敛速率。关键在于,只要奖励的幅值不是太大,该区域就远小于我们投影进入的区域。现有关于线性SARSA收敛到不动点的研究均要求SARSA策略改进算子的Lipschitz常数足够小;而我们的分析适用于任意Lipschitz常数,从而刻画了线性SARSA在新场景下的行为特性。