Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by introducing the Bounded Ratio Reinforcement Learning (BRRL) framework. We formulate a novel regularized and constrained policy optimization problem and derive its analytical optimal solution. We prove that this solution ensures monotonic performance improvement. To handle parameterized policy classes, we develop a policy optimization algorithm called Bounded Policy Optimization (BPO) that minimizes an advantage-weighted divergence between the policy and the analytic optimal solution from BRRL. We further establish a lower bound on the expected performance of the resulting policy in terms of the BPO loss function. Notably, our framework also provides a new theoretical lens to interpret the success of the PPO loss, and connects trust region policy optimization and the Cross-Entropy Method (CEM). We additionally extend BPO to Group-relative BPO (GBPO) for LLM fine-tuning. Empirical evaluations of BPO across MuJoCo, Atari, and complex IsaacLab environments (e.g., Humanoid locomotion), and of GBPO for LLM fine-tuning tasks, demonstrate that BPO and GBPO generally match or outperform PPO and GRPO in stability and final performance.
翻译:近端策略优化(PPO)因其跨领域的可扩展性和经验鲁棒性,已成为在线策略强化学习的主流算法。然而,信任区域方法的基础理论与PPO中使用的启发式裁剪目标之间存在显著脱节。本文通过引入有界比率强化学习(BRRL)框架弥合了这一差距。我们提出了一个新颖的正则化与约束策略优化问题,并导出其解析最优解,证明该解能保证策略的单调性能提升。为处理参数化策略类别,我们开发了一种称为有界策略优化(BPO)的策略优化算法,该算法通过最小化当前策略与BRRL解析最优解之间的优势加权散度来实现。进一步,我们建立了目标策略期望性能关于BPO损失函数的下界。值得注意的是,我们的框架还为解释PPO损失的成功提供了新的理论视角,并建立了信任区域策略优化与交叉熵方法(CEM)之间的关联。此外,我们将BPO扩展为组相对BPO(GBPO),用于大语言模型(LLM)微调。在MuJoCo、Atari和复杂IsaacLab环境(如人形机器人运动)中对BPO进行的经验评估,以及在LLM微调任务中对GBPO的评估表明,BPO和GBPO在稳定性和最终性能上通常与PPO和GRPO相当或更优。