光滑Q学习算法的统一常微分方程分析 (Unified ODE Analysis of Smooth Q-Learning Algorithms)

Convergence of Q-learning has been the focus of extensive research over the past several decades. Recently, an asymptotic convergence analysis for Q-learning was introduced using a switching system framework. This approach applies the so-called ordinary differential equation (ODE) approach to prove the convergence of the asynchronous Q-learning modeled as a continuous-time switching system, where notions from switching system theory are used to prove its asymptotic stability without using explicit Lyapunov arguments. However, to prove stability, restrictive conditions, such as quasi-monotonicity, must be satisfied for the underlying switching systems, which makes it hard to easily generalize the analysis method to other reinforcement learning algorithms, such as the smooth Q-learning variants. In this paper, we present a more general and unified convergence analysis that improves upon the switching system approach and can analyze Q-learning and its smooth variants. The proposed analysis is motivated by previous work on the convergence of synchronous Q-learning based on $p$-norm serving as a Lyapunov function. However, the proposed analysis addresses more general ODE models that can cover both asynchronous Q-learning and its smooth versions with simpler frameworks.

翻译：Q学习的收敛性在过去几十年中一直是广泛研究的焦点。最近，一种基于切换系统框架的渐近收敛分析方法被引入用于Q学习。该方法应用所谓的常微分方程（ODE）方法，将异步Q学习建模为连续时间切换系统以证明其收敛性，其中利用切换系统理论中的概念来证明其渐近稳定性，而无需使用显式的李雅普诺夫论证。然而，为了证明稳定性，底层切换系统必须满足诸如拟单调性等限制性条件，这使得该分析方法难以轻松推广至其他强化学习算法，例如光滑Q学习变体。本文提出了一种更通用且统一的收敛性分析方法，该方法改进了切换系统方法，能够分析Q学习及其光滑变体。所提出的分析受到先前基于作为李雅普诺夫函数的$p$范数来研究同步Q学习收敛性的工作的启发。然而，所提出的分析处理了更一般的ODE模型，该模型能够以更简单的框架覆盖异步Q学习及其光滑版本。