I propose \textbf{G}roup-relative \textbf{I}mplicit \textbf{F}ine \textbf{T}uning (GIFT), a novel reinforcement learning framework for aligning LLMs. Instead of directly maximizing cumulative rewards like PPO or GRPO, GIFT minimizes the discrepancy between implicit and explicit reward models. It combines three key ideas: (1) the online multi-response generation and normalization of GRPO, (2) the implicit reward formulation of DPO, and (3) the implicit-explicit reward alignment principle of UNA. By jointly normalizing the implicit and explicit rewards, GIFT eliminates an otherwise intractable term that prevents effective use of implicit rewards. This normalization transforms the complex reward maximization objective into a simple mean squared error (MSE) loss between the normalized reward functions, converting a non-convex optimization problem into a convex, stable, and analytically differentiable formulation. Unlike offline methods such as DPO and UNA, GIFT remains on-policy and thus retains exploration capability. Compared to GRPO, it requires fewer hyperparameters, converges faster, and generalizes better with significantly reduced training overfitting. Empirically, GIFT achieves superior reasoning and alignment performance on mathematical benchmarks while remaining computationally efficient.
翻译:本文提出一种新颖的强化学习框架——**组相对隐式微调**(GIFT),用于大语言模型的对齐任务。与PPO或GRPO直接最大化累积奖励的方法不同,GIFT致力于最小化隐式奖励模型与显式奖励模型之间的差异。该框架融合了三个核心思想:(1)GRPO的在线多响应生成与归一化机制,(2)DPO的隐式奖励建模方法,以及(3)UNA的隐式-显式奖励对齐原则。通过对隐式与显式奖励进行联合归一化处理,GIFT消除了原本阻碍隐式奖励有效使用的难解项。此归一化操作将复杂的奖励最大化目标转化为归一化奖励函数之间的简单均方误差(MSE)损失,从而将非凸优化问题转化为凸优化、稳定且可解析求导的形式。与DPO、UNA等离线方法相比,GIFT保持在线策略特性,因而保留了探索能力。相较于GRPO,GIFT需要更少的超参数、收敛速度更快、泛化能力更强,并能显著降低训练过拟合。实验表明,GIFT在数学推理基准测试中实现了卓越的推理与对齐性能,同时保持较高的计算效率。