Q-learning has become an important part of the reinforcement learning toolkit since its introduction in the dissertation of Chris Watkins in the 1980s. The purpose of this paper is in part a tutorial on stochastic approximation and Q-learning, providing details regarding the INFORMS APS inaugural Applied Probability Trust Plenary Lecture, presented in Nancy France, June 2023. The paper also presents new approaches to ensure stability and potentially accelerated convergence for these algorithms, and stochastic approximation in other settings. Two contributions are entirely new: 1. Stability of Q-learning with linear function approximation has been an open topic for research for over three decades. It is shown that with appropriate optimistic training in the form of a modified Gibbs policy, there exists a solution to the projected Bellman equation, and the algorithm is stable (in terms of bounded parameter estimates). Convergence remains one of many open topics for research. 2. The new Zap Zero algorithm is designed to approximate the Newton-Raphson flow without matrix inversion. It is stable and convergent under mild assumptions on the mean flow vector field for the algorithm, and compatible statistical assumption on an underlying Markov chain. The algorithm is a general approach to stochastic approximation which in particular applies to Q-learning with "oblivious" training even with non-linear function approximation.
翻译:Q-learning自20世纪80年代Chris Watkins在其博士论文中提出以来,已成为强化学习工具包的重要组成部分。本文部分内容旨在作为关于随机逼近与Q-learning的教程,详细介绍了2023年6月于法国南锡举行的INFORMS APS首届应用概率信托全体讲座。本文还提出了确保这些算法稳定性并可能加速收敛的新方法,以及随机逼近在其他场景中的应用。全新贡献包含以下两点:1. 具有线性函数逼近的Q-learning稳定性问题已困扰学界三十余年。研究表明,采用修正的Gibbs策略形式的适当乐观训练时,投影贝尔曼方程存在解,且该算法(在参数估计有界意义上)是稳定的。收敛性仍是众多开放式研究课题之一。2. 新提出的Zap Zero算法旨在无需矩阵求逆即可近似牛顿-拉夫逊流。该算法在对平均流向量场的温和假设及底层马尔可夫链的相容统计假设下具有稳定性和收敛性。该算法是一种通用的随机逼近方法,尤其适用于即使采用非线性函数逼近的"无感知"训练Q-learning。