Reinforcement Learning (RL) with PPO-like clip objectives has become the standard choice for reward-based fine-tuning of large language models (LLMs). Although recent work has explored improved estimators of advantages and normalization, the clipping mechanism itself has remained untouched. Originally introduced as a proxy for principled KL-based trust regions, clipping is a crude approximation that often causes unstable updates and suboptimal performance. We replace the clip objective with a novel discrete differentiable trust region projection, which provides principled token-level KL constraints. The projection operates on a sparse subset of the model's most important token logits to balance computational cost and projection effectiveness. Our approach, Trust Region Optimization for Large Language models (TROLL), serves as a direct replacement for PPO-like clipping during training and does not alter the model's inference behavior. Across mathematical reasoning and code generation tasks, model families, as well as advantage-estimation methods, TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates.
翻译:基于PPO类裁剪目标的强化学习已成为大语言模型奖励微调的标准方法。尽管近期研究探索了优势函数估计与归一化的改进方案,但裁剪机制本身始终未被触及。该机制最初作为基于KL散度理论信任区域的替代方案引入,实则是一种粗略近似,常导致更新不稳定与性能欠佳。我们提出用新型离散可微信任区域投影替代裁剪目标,该方法可提供理论完备的令牌级KL约束。该投影作用于模型最重要令牌逻辑值的稀疏子集,以平衡计算成本与投影有效性。我们的大语言模型信任区域优化方法TROLL可直接替代训练过程中的PPO类裁剪机制,且不改变模型推理行为。在数学推理与代码生成任务、不同模型家族以及优势估计方法的广泛实验中,TROLL在训练速度、稳定性与最终成功率方面均持续优于PPO类裁剪方法。