Actor-critic methods have achieved significant success in many challenging applications. However, its finite-time convergence is still poorly understood in the most practical single-timescale form. Existing works on analyzing single-timescale actor-critic have been limited to i.i.d. sampling or tabular setting for simplicity. We investigate the more practical online single-timescale actor-critic algorithm on continuous state space, where the critic assumes linear function approximation and updates with a single Markovian sample per actor step. Previous analysis has been unable to establish the convergence for such a challenging scenario. We demonstrate that the online single-timescale actor-critic method provably finds an $\epsilon$-approximate stationary point with $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity under standard assumptions, which can be further improved to $\mathcal{O}(\epsilon^{-2})$ under the i.i.d. sampling. Our novel framework systematically evaluates and controls the error propagation between the actor and critic. It offers a promising approach for analyzing other single-timescale reinforcement learning algorithms as well.
翻译:演员-评论家方法已在许多具有挑战性的应用中取得了显著成功。然而,在实际中最常用的单时间尺度形式下,其有限时间收敛性仍鲜有透彻理解。现有关于单时间尺度演员-评论家方法的分析工作,为简化起见通常局限于独立同分布采样或表格设置。我们研究了连续状态空间上更为实用的在线单时间尺度演员-评论家算法,其中评论家采用线性函数逼近,并在每个演员步长内以单一马尔可夫样本进行更新。先前分析未能建立此类挑战性场景下的收敛性。我们证明,在线单时间尺度演员-评论家方法在标准假设下,以$\widetilde{\mathcal{O}}(\epsilon^{-2})$的样本复杂度以可证明方式找到$\epsilon$-近似平稳点,而在独立同分布采样下可进一步改进至$\mathcal{O}(\epsilon^{-2})$。我们提出的新颖框架系统性地评估并控制了演员与评论家之间的误差传播,也为分析其他单时间尺度强化学习算法提供了有前景的途径。