Proximal Policy Optimization algorithm employing a clipped surrogate objective (PPO-Clip) is a prominent exemplar of the policy optimization methods. However, despite its remarkable empirical success, PPO-Clip lacks theoretical substantiation to date. In this paper, we contribute to the field by establishing the first global convergence results of a PPO-Clip variant in both tabular and neural function approximation settings. Our findings highlight the $O(1/\sqrt{T})$ min-iterate convergence rate specifically in the context of neural function approximation. We tackle the inherent challenges in analyzing PPO-Clip through three central concepts: (i) We introduce a generalized version of the PPO-Clip objective, illuminated by its connection with the hinge loss. (ii) Employing entropic mirror descent, we establish asymptotic convergence for tabular PPO-Clip with direct policy parameterization. (iii) Inspired by the tabular analysis, we streamline convergence analysis by introducing a two-step policy improvement approach. This decouples policy search from complex neural policy parameterization using a regression-based update scheme. Furthermore, we gain deeper insights into the efficacy of PPO-Clip by interpreting these generalized objectives. Our theoretical findings also mark the first characterization of the influence of the clipping mechanism on PPO-Clip convergence. Importantly, the clipping range affects only the pre-constant of the convergence rate.
翻译:采用裁剪替代目标的近端策略优化算法(PPO-Clip)是策略优化方法的典型代表。然而,尽管该算法在实证中取得显著成功,至今仍缺乏理论支撑。本文通过建立PPO-Clip变体在表格法和神经网络函数逼近两种场景下的首次全局收敛结果,为该领域做出贡献。我们的研究特别揭示了神经网络函数逼近场景中$O(1/\sqrt{T})$的极小迭代收敛速率。我们通过三个核心概念攻克了分析PPO-Clip的内在挑战:(i) 提出广义PPO-Clip目标函数,阐明其与铰链损失的关联;(ii) 利用熵镜面下降法,建立了直接策略参数化下表格法PPO-Clip的渐近收敛性;(iii) 受表格法分析启发,通过引入两步策略改进方法简化收敛分析,该方法采用基于回归的更新机制将策略搜索与复杂神经策略参数化解耦。此外,通过解读这些广义目标函数,我们更深入地揭示了PPO-Clip的有效性。我们的理论发现首次刻画了裁剪机制对PPO-Clip收敛性的影响特征。重要的是,裁剪范围仅影响收敛速率的前置常数。