Proximal Policy Optimization (PPO) is widely used in reinforcement learning due to its strong empirical performance, yet it lacks formal guarantees for policy improvement and convergence. PPO's clipped surrogate objective is motivated by a lower bound on linearization of the value function in flat geometry setting. We derive a tighter surrogate objective and introduce Fisher-Rao PPO (FR-PPO) by leveraging the Fisher-Rao (FR) geometry. Our scheme provides strong theoretical guarantees, including monotonic policy improvement. In the direct parametrization setting, we show that FR-PPO achieves sub-linear convergence with no dependence on action or state space dimensions, and for parametrized policies we further obtain sub-linear convergence up to the compatible function approximation error. Finally, although our primary focus is theoretical, we also demonstrate empirically that FR-PPO performs well across a range of standard reinforcement learning tasks.
翻译:近端策略优化(PPO)因其强大的实证性能而被广泛应用于强化学习,但其在策略改进与收敛性方面缺乏形式化保证。PPO 的裁剪替代目标源于平坦几何设置下价值函数线性化的一个下界。我们推导出一个更紧致的替代目标,并通过利用 Fisher-Rao(FR)几何,引入了 Fisher-Rao PPO(FR-PPO)。我们的方案提供了强有力的理论保证,包括单调策略改进。在直接参数化设置下,我们证明 FR-PPO 实现了次线性收敛,且不依赖于动作或状态空间的维度;对于参数化策略,我们进一步获得了达到兼容函数逼近误差的次线性收敛性。最后,尽管我们的主要关注点是理论分析,但我们也通过实证表明,FR-PPO 在一系列标准强化学习任务中表现良好。