We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We begin by observing that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.
翻译:我们重新审视了标准表格型演员-评论家算法的形式化表述,将其视为一个双时间尺度随机逼近过程,其中价值函数在较快的时间尺度上计算,而策略在较慢的时间尺度上计算。这种设置模拟了策略迭代。我们首先观察到,将时间尺度反转实际上会模拟值迭代,并且这是一种合法的算法。我们提供了收敛性证明,并在有/无函数近似(分别使用线性和非线性函数近似器)的情况下对两者进行了实证比较,发现我们提出的评论家-演员算法在准确性和计算开销方面均与演员-评论家算法表现相当。