Reinforcement Learning (RL) with PPO-like clip objectives has become the standard choice for reward-based fine-tuning of large language models (LLMs). Although recent work has explored improved estimators of advantages and normalization, the clipping mechanism itself has remained untouched. Originally introduced as a proxy for principled KL-based trust regions, clipping is a crude approximation that often causes unstable updates and suboptimal performance. We replace the clip objective with a novel discrete differentiable trust region projection, which provides principled token-level KL constraints. The projection operates on a sparse subset of the model's most important token logits to balance computational cost and projection effectiveness. Our approach, Trust Region Optimization for Large Language models (TROLL), serves as a direct replacement for PPO-like clipping during training and does not alter the model's inference behavior. Across mathematical reasoning and code generation tasks, model families, as well as advantage-estimation methods, TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates.
翻译:基于PPO类裁剪目标的强化学习已成为大语言模型奖励微调的标准方法。尽管近期研究探索了优势函数估计与归一化的改进方案,但裁剪机制本身始终未被触及。该机制最初作为基于KL散度的理论化信任区域代理被提出,实则是一种粗略近似,常导致训练更新不稳定与次优性能。我们提出以新型离散可微信任区域投影替代传统裁剪目标,该投影可提供理论完备的词元级KL约束。该机制通过仅对模型关键词元logits的稀疏子集进行投影操作,在计算成本与投影效能间取得平衡。我们提出的"大语言模型信任区域优化方法"可直接替代训练过程中的PPO类裁剪机制,且不改变模型推理行为。在数学推理与代码生成任务、不同模型体系及优势估计方法的综合实验中,TROLL在训练速度、稳定性与最终成功率方面均持续优于PPO类裁剪方法。