Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by introducing the Bounded Ratio Reinforcement Learning (BRRL) framework. We formulate a novel regularized and constrained policy optimization problem and derive its analytical optimal solution. We prove that this solution ensures monotonic performance improvement. To handle parameterized policy classes, we develop a policy optimization algorithm called Bounded Policy Optimization (BPO) that minimizes an advantage-weighted divergence between the policy and the analytic optimal solution from BRRL. We further establish a lower bound on the expected performance of the resulting policy in terms of the BPO loss function. Notably, our framework also provides a new theoretical lens to interpret the success of the PPO loss, and connects trust region policy optimization and the Cross-Entropy Method (CEM). We additionally extend BPO to Group-relative BPO (GBPO) for LLM fine-tuning. Empirical evaluations of BPO across MuJoCo, Atari, and complex IsaacLab environments (e.g., Humanoid locomotion), and of GBPO for LLM fine-tuning tasks, demonstrate that BPO and GBPO generally match or outperform PPO and GRPO in stability and final performance.
翻译:近端策略优化(PPO)因其可扩展性和跨领域的经验鲁棒性,已成为在线强化学习中的主导算法。然而,信任域方法的基本原理与PPO中使用的启发式裁剪目标之间存在显著脱节。本文通过引入有界比率强化学习(BRRL)框架来弥合这一差距。我们制定了一个新颖的正则化与约束策略优化问题,并推导出其解析最优解。我们证明该解能确保单调性能提升。为处理参数化策略类别,我们开发了一种名为有界策略优化(BPO)的策略优化算法,该算法通过最小化策略与BRRL解析最优解之间的优势加权散度来实现。我们进一步建立了以BPO损失函数表示的所得策略预期性能的下界。值得注意的是,我们的框架还提供了新的理论视角来解释PPO损失的成功之处,并连接了信任域策略优化与交叉熵方法(CEM)。此外,我们将BPO扩展为组相关BPO(GBPO)以用于大语言模型微调。在MuJoCo、Atari及复杂IsaacLab环境(如人形机器人运动)上对BPO进行的经验评估,以及在LLM微调任务上对GBPO的评估表明,BPO和GBPO在稳定性和最终性能上普遍达到或超越了PPO和GRPO。