Actor-critic algorithms are widely used in reinforcement learning, but are challenging to mathematically analyse due to the online arrival of non-i.i.d. data samples. The distribution of the data samples dynamically changes as the model is updated, introducing a complex feedback loop between the data distribution and the reinforcement learning algorithm. We prove that, under a time rescaling, the online actor-critic algorithm with tabular parametrization converges to an ordinary differential equation (ODE) as the number of updates becomes large. The proof first establishes the geometric ergodicity of the data samples under a fixed actor policy. Then, using a Poisson equation, we prove that the fluctuations of the data samples around a dynamic probability measure, which is a function of the evolving actor model, vanish as the number of updates become large. Once the ODE limit has been derived, we study its convergence properties using a two time-scale analysis which asymptotically de-couples the critic ODE from the actor ODE. The convergence of the critic to the solution of the Bellman equation and the actor to the optimal policy are proven. In addition, a convergence rate to this global minimum is also established. Our convergence analysis holds under specific choices for the learning rates and exploration rates in the actor-critic algorithm, which could provide guidance for the implementation of actor-critic algorithms in practice.
翻译:演员-评论家算法在强化学习中被广泛应用,但由于非独立同分布数据样本的在线到达,其数学分析极具挑战性。数据样本的分布会随模型更新动态变化,从而在数据分布与强化学习算法之间引入复杂的反馈回路。我们证明,在时间重缩放条件下,采用表格参数化的在线演员-评论家算法在更新次数趋于无穷时收敛至常微分方程(ODE)。证明首先确立了固定演员策略下数据样本的几何遍历性;随后利用泊松方程,证明数据样本围绕动态概率测度(该测度随演员模型演化)的波动随更新次数增加而消失。在获得ODE极限后,我们通过渐近解耦评论家ODE与演员ODE的两时间尺度分析,研究其收敛性质:证明了评论家收敛至贝尔曼方程的解,演员收敛至最优策略。此外,还建立了趋近该全局最小值的收敛速率。该收敛性分析在演员-评论家算法中学习率与探索率的特定选择条件下成立,可为实际应用中演员-评论家算法的实现提供指导。