Actor-critic (AC) is a powerful method for learning an optimal policy in reinforcement learning, where the critic uses algorithms, e.g., temporal difference (TD) learning with function approximation, to evaluate the current policy and the actor updates the policy along an approximate gradient direction using information from the critic. This paper provides the \textit{tightest} non-asymptotic convergence bounds for both the AC and natural AC (NAC) algorithms. Specifically, existing studies show that AC converges to an $\epsilon+\varepsilon_{\text{critic}}$ neighborhood of stationary points with the best known sample complexity of $\mathcal{O}(\epsilon^{-2})$ (up to a log factor), and NAC converges to an $\epsilon+\varepsilon_{\text{critic}}+\sqrt{\varepsilon_{\text{actor}}}$ neighborhood of the global optimum with the best known sample complexity of $\mathcal{O}(\epsilon^{-3})$, where $\varepsilon_{\text{critic}}$ is the approximation error of the critic and $\varepsilon_{\text{actor}}$ is the approximation error induced by the insufficient expressive power of the parameterized policy class. This paper analyzes the convergence of both AC and NAC algorithms with compatible function approximation. Our analysis eliminates the term $\varepsilon_{\text{critic}}$ from the error bounds while still achieving the best known sample complexities. Moreover, we focus on the challenging single-loop setting with a single Markovian sample trajectory. Our major technical novelty lies in analyzing the stochastic bias due to policy-dependent and time-varying compatible function approximation in the critic, and handling the non-ergodicity of the MDP due to the single Markovian sample trajectory. Numerical results are also provided in the appendix.
翻译:演员-评论家(AC)是强化学习中学习最优策略的有效方法,其中评论家采用时序差分(TD)学习等算法结合函数逼近来评估当前策略,演员则利用评论家提供的信息沿近似梯度方向更新策略。本文为AC及自然AC(NAC)算法提供了目前\textit{最紧}的非渐近收敛界。具体而言,现有研究表明AC能以$\mathcal{O}(\epsilon^{-2})$(忽略对数因子)的最佳已知样本复杂度收敛至稳定点的$\epsilon+\varepsilon_{\text{critic}}$邻域,而NAC能以$\mathcal{O}(\epsilon^{-3})$的最佳已知样本复杂度收敛至全局最优解的$\epsilon+\varepsilon_{\text{critic}}+\sqrt{\varepsilon_{\text{actor}}}$邻域,其中$\varepsilon_{\text{critic}}$为评论家的逼近误差,$\varepsilon_{\text{actor}}$是参数化策略类表达能力不足引起的逼近误差。本文分析了采用兼容函数逼近的AC与NAC算法的收敛性。我们的分析在保持最佳已知样本复杂度的同时,从误差界中消除了$\varepsilon_{\text{critic}}$项。此外,我们聚焦于具有挑战性的单循环设定——仅使用单条马尔可夫样本轨迹。主要技术创新在于分析了评论家中因策略依赖且时变的兼容函数逼近所产生的随机偏差,并处理了单条马尔可夫样本轨迹导致的MDP非遍历性问题。数值实验结果详见附录。