In recent years, there has been a lot of research work activity focused on carrying out asymptotic and non-asymptotic convergence analyses for two-timescale actor critic algorithms where the actor updates are performed on a timescale that is slower than that of the critic. In a recent work, the critic-actor algorithm has been presented for the infinite horizon discounted cost setting in the look-up table case where the timescales of the actor and the critic are reversed and asymptotic convergence analysis has been presented. In our work, we present the first critic-actor algorithm with function approximation and in the long-run average reward setting and present the first finite-time (non-asymptotic) analysis of such a scheme. We obtain optimal learning rates and prove that our algorithm achieves a sample complexity of $\mathcal{\tilde{O}}(\epsilon^{-2.08})$ for the mean squared error of the critic to be upper bounded by $\epsilon$ which is better than the one obtained for actor-critic in a similar setting. We also show the results of numerical experiments on three benchmark settings and observe that the critic-actor algorithm competes well with the actor-critic algorithm.
翻译:近年来,大量研究工作聚焦于双时间尺度演员-评论家算法的渐近与非渐近收敛性分析,其中演员的更新时间尺度慢于评论家。在最近的一项研究中,针对无限折扣成本设置的查表场景,提出了一种时间尺度反转的评论家-演员算法,并给出了渐近收敛性分析。本文首次提出在长期平均奖励设置下结合函数逼近的评论家-演员算法,并给出了此类方案的首次有限时间(非渐近)分析。我们获得了最优学习率,并证明该算法在评论家均方误差上界为$\epsilon$时实现了$\mathcal{\tilde{O}}(\epsilon^{-2.08})$的样本复杂度,优于类似设置下演员-评论家算法的结果。此外,我们在三个基准场景中展示了数值实验结果,观察到评论家-演员算法与演员-评论家算法性能相当。