Since the objective functions of reinforcement learning problems are typically highly nonconvex, it is desirable that policy gradient, the most popular algorithm, escapes saddle points and arrives at second-order stationary points. Existing results only consider vanilla policy gradient algorithms with unbiased gradient estimators, but practical implementations under the infinite-horizon discounted reward setting are biased due to finite-horizon sampling. Moreover, actor-critic methods, whose second-order convergence has not yet been established, are also biased due to the critic approximation of the value function. We provide a novel second-order analysis of biased policy gradient methods, including the vanilla gradient estimator computed from Monte-Carlo sampling of trajectories as well as the double-loop actor-critic algorithm, where in the inner loop the critic improves the approximation of the value function via TD(0) learning. Separately, we also establish the convergence of TD(0) on Markov chains irrespective of initial state distribution.
翻译:由于强化学习问题的目标函数通常高度非凸,因此最流行的算法——策略梯度——需要能够逃离鞍点并到达二阶驻点。现有研究仅考虑使用无偏梯度估计器的朴素策略梯度算法,但无限时域折扣奖励设置下的实际实现因有限时域采样而存在偏差。此外,演员-评论家方法(其二阶收敛性尚未建立)也因评论家对价值函数的近似而存在偏差。我们针对有偏策略梯度方法提供了全新的二阶分析,涵盖基于轨迹蒙特卡洛采样计算的朴素梯度估计器,以及双循环演员-评论家算法(其中在内循环中评论家通过TD(0)学习改进价值函数的近似)。此外,我们还独立证明了马尔可夫链上TD(0)算法的收敛性,且该结果与初始状态分布无关。