Policy gradient methods are one of the most successful methods for solving challenging reinforcement learning problems. However, despite their empirical successes, many SOTA policy gradient algorithms for discounted problems deviate from the theoretical policy gradient theorem due to the existence of a distribution mismatch. In this work, we analyze the impact of this mismatch on the policy gradient methods. Specifically, we first show that in the case of tabular parameterizations, the methods under the mismatch remain globally optimal. Then, we extend this analysis to more general parameterizations by leveraging the theory of biased stochastic gradient descent. Our findings offer new insights into the robustness of policy gradient methods as well as the gap between theoretical foundations and practical implementations.
翻译:策略梯度方法是解决强化学习难题最成功的方法之一。然而,尽管在实证上取得了成功,由于分布失配的存在,许多针对折扣问题的先进策略梯度算法偏离了理论上的策略梯度定理。本文分析了这种失配对策略梯度方法的影响。具体而言,我们首先证明在表格化参数化的情况下,失配下的方法仍保持全局最优性。随后,我们借助有偏随机梯度下降理论,将这一分析扩展到更一般的参数化情形。我们的研究结果为策略梯度方法的鲁棒性,以及理论基础与实际实现之间的差距提供了新的见解。