Policy gradient methods, where one searches for the policy of interest by maximizing the value functions using first-order information, become increasingly popular for sequential decision making in reinforcement learning, games, and control. Guaranteeing the global optimality of policy gradient methods, however, is highly nontrivial due to nonconcavity of the value functions. In this exposition, we highlight recent progresses in understanding and developing policy gradient methods with global convergence guarantees, putting an emphasis on their finite-time convergence rates with regard to salient problem parameters.
翻译:策略梯度方法通过使用一阶信息最大化价值函数来搜索目标策略,在强化学习、博弈和控制的序贯决策中日益流行。然而,由于价值函数的非凹性,保证策略梯度方法的全局最优性极具挑战性。本文重点介绍在理解与开发具有全局收敛保证的策略梯度方法方面取得的最新进展,并着重阐述其在关键问题参数下的有限时间收敛速率。