Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL). To avoid huge memory footprint, we introduce the efficient Binary and Top-K approximations to capture the essential divergence with negligible overhead. Extensive empirical evaluations demonstrate that DPPO achieves superior training stability and efficiency compared to existing methods, offering a more robust foundation for RL-based LLM fine-tuning. Our code is available at https://github.com/sail-sg/Stable-RL.
翻译:强化学习(RL)已成为微调大语言模型(LLMs)的关键技术,而近端策略优化(PPO)则是事实上的标准算法。尽管PPO应用广泛,我们认为其核心比率裁剪机制在结构上并不适用于LLMs中的大词汇量。PPO基于采样词元的概率比约束策略更新,这实际上是对真实策略散度的有噪单样本蒙特卡洛估计。这种机制产生了次优的学习动态:对低概率词元的更新受到过度惩罚,而高概率词元的潜在危险偏移却约束不足,导致训练效率低下和不稳定性。为解决此问题,我们提出散度近端策略优化(DPPO),该方法用基于策略散度(如总变差或KL散度)直接估计的更原则性约束替代了启发式裁剪。为避免巨大内存开销,我们引入了高效的二值化和Top-K近似方法,以可忽略的额外成本捕获关键散度。大量实验评估表明,DPPO在训练稳定性和效率方面优于现有方法,为基于RL的LLM微调提供了更稳健的基础。我们的代码开源在https://github.com/sail-sg/Stable-RL。