Convergence of Q-learning has been the focus of extensive research over the past several decades. Recently, an asymptotic convergence analysis for Q-learning was introduced using a switching system framework. This approach applies the so-called ordinary differential equation (ODE) approach to prove the convergence of the asynchronous Q-learning modeled as a continuous-time switching system, where notions from switching system theory are used to prove its asymptotic stability without using explicit Lyapunov arguments. However, to prove stability, restrictive conditions, such as quasi-monotonicity, must be satisfied for the underlying switching systems, which makes it hard to easily generalize the analysis method to other reinforcement learning algorithms, such as the smooth Q-learning variants. In this paper, we present a more general and unified convergence analysis that improves upon the switching system approach and can analyze Q-learning and its smooth variants. The proposed analysis is motivated by previous work on the convergence of synchronous Q-learning based on $p$-norm serving as a Lyapunov function. However, the proposed analysis addresses more general ODE models that can cover both asynchronous Q-learning and its smooth versions with simpler frameworks.
翻译:Q-学习的收敛性在过去数十年间一直是广泛研究的焦点。最近,有研究采用切换系统框架对 Q-学习进行了渐近收敛性分析。该方法应用所谓的常微分方程(ODE)方法,将异步 Q-学习建模为连续时间切换系统,并利用切换系统理论中的概念证明其渐近稳定性,而无需显式使用 Lyapunov 论证。然而,为了证明稳定性,底层切换系统必须满足诸如拟单调性等限制性条件,这使得该分析方法难以直接推广到其他强化学习算法,例如平滑 Q-学习的各种变体。本文提出了一种更通用、统一的收敛性分析方法,改进了切换系统方法,能够分析 Q-学习及其平滑变体。所提出的分析受到先前基于 $p$-范数作为 Lyapunov 函数的同步 Q-学习收敛性研究的启发。然而,本文提出的分析处理了更一般的 ODE 模型,能够以更简洁的框架涵盖异步 Q-学习及其平滑版本。