The reinforcement learning algorithm SARSA combined with linear function approximation has been shown to converge for infinite horizon discounted Markov decision problems (MDPs). In this paper, we investigate the convergence of the algorithm for random horizon MDPs, which has not previously been shown. We show, similar to earlier results for infinite horizon discounted MDPs, that if the behaviour policy is $\varepsilon$-soft and Lipschitz continuous with respect to the weight vector of the linear function approximation, with small enough Lipschitz constant, then the algorithm will converge with probability one when considering a random horizon MDP.
翻译:强化学习算法SARSA结合线性函数逼近已在无限时间范围折扣马尔可夫决策问题(MDPs)中被证明收敛。本文研究该算法在随机时间范围MDPs中的收敛性问题,此情形此前尚未得到证明。我们证明,与无限时间范围折扣MDPs的早期结论类似:若行为策略在正则条件下满足$\varepsilon$-软性,且关于线性函数逼近权重向量的Lipschitz连续性成立且Lipschitz常数足够小,则算法在处理随机时间范围MDPs时将以概率1收敛。