Proximal Policy Optimization (PPO) is among the most widely used deep reinforcement learning algorithms, yet its theoretical foundations remain incomplete. Most importantly, convergence and understanding of fundamental PPO advantages remain widely open. Under standard theory assumptions we show how PPO's policy update scheme (performing multiple epochs of minibatch updates on multi-use rollouts with a surrogate gradient) can be interpreted as approximated policy gradient ascent. We show how to control the bias accumulated by the surrogate gradients and use techniques from random reshuffling to prove a convergence theorem for PPO that sheds light on PPO's success. Additionally, we identify a previously overlooked issue in truncated Generalized Advantage Estimation commonly used in PPO. The geometric weighting scheme induces infinite mass collapse onto the longest $k$-step advantage estimator at episode boundaries. Empirical evaluations show that a simple weight correction can yield substantial improvements in environments with strong terminal signal, such as Lunar Lander.
翻译:近端策略优化(PPO)是最广泛使用的深度强化学习算法之一,但其理论基础仍不完整。尤为关键的是,关于PPO基本优势的收敛性与理论理解仍存在广泛空白。在标准理论假设下,我们证明了PPO的策略更新机制(利用替代梯度对重复使用的轨迹数据进行多轮小批量更新)可被解释为近似策略梯度上升。我们展示了如何控制替代梯度累积的偏差,并运用随机重排技术证明了PPO的收敛定理,从而揭示了PPO的成功机理。此外,我们发现了PPO中常用的截断广义优势估计方法存在一个先前被忽视的问题:几何加权方案在回合边界处会导致无限质量塌缩至最长的$k$步优势估计量。实证评估表明,在具有强终止信号的环境中(如月球着陆器),简单的权重修正能带来显著性能提升。