Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL). To avoid huge memory footprint, we introduce the efficient Binary and Top-K approximations to capture the essential divergence with negligible overhead. Extensive empirical evaluations demonstrate that DPPO achieves superior training stability and efficiency compared to existing methods, offering a more robust foundation for RL-based LLM fine-tuning.
翻译:强化学习已成为微调大型语言模型的核心技术,其中近端策略优化算法已成为事实上的标准算法。尽管该算法应用广泛,我们认为PPO的核心比率裁剪机制在结构上并不适合LLM固有的庞大词汇表。PPO基于采样标记的概率比率来约束策略更新,该比率作为真实策略散度的噪声单样本蒙特卡洛估计。这导致了次优的学习动态:对低概率标记的更新被过度惩罚,而高概率标记中潜在的灾难性偏移却约束不足,从而造成训练效率低下和稳定性问题。为解决这一问题,我们提出散度近端策略优化算法,用基于策略散度直接估计(如全变差或KL散度)的原理性约束替代启发式裁剪。为避免巨大的内存占用,我们引入了高效的二元和Top-K近似方法,以可忽略的开销捕获核心散度。大量实证评估表明,DPPO相比现有方法实现了更优的训练稳定性和效率,为基于强化学习的LLM微调提供了更稳健的基础。