This paper develops an unified framework to study finite-sample convergence guarantees of a large class of value-based asynchronous reinforcement learning (RL) algorithms. We do this by first reformulating the RL algorithms as \textit{Markovian Stochastic Approximation} (SA) algorithms to solve fixed-point equations. We then develop a Lyapunov analysis and derive mean-square error bounds on the convergence of the Markovian SA. Based on this result, we establish finite-sample mean-square convergence bounds for asynchronous RL algorithms such as $Q$-learning, $n$-step TD, TD$(\lambda)$, and off-policy TD algorithms including V-trace. As a by-product, by analyzing the convergence bounds of $n$-step TD and TD$(\lambda)$, we provide theoretical insights into the bias-variance trade-off, i.e., efficiency of bootstrapping in RL. This was first posed as an open problem in (Sutton, 1999).
翻译:本文开发了一个统一框架,用于研究一大类基于价值的异步强化学习(RL)算法的有限样本收敛保证。我们首先将RL算法重述为求解不动点方程的马尔可夫随机逼近(SA)算法,随后建立李雅普诺夫分析并推导出马尔可夫SA收敛的均方误差界。基于此结果,我们为异步RL算法(包括Q学习、n步TD、TD(λ)以及V-trace等离策略TD算法)建立了有限样本均方收敛界。作为副产品,通过分析n步TD和TD(λ)的收敛界,我们为强化学习中的偏差-方差权衡(即自助法的效率)提供了理论洞见——该问题最早由Sutton在1999年作为开放问题提出。