Recent work shows that preference alignment objectives can be interpreted as divergence estimators between aligned (preferred) & unaligned (less-preferred) distributions, yielding a principled recipe for designing alignment losses. However, this view has so far been limited to preference-based supervision. We extend it to general LLM alignment, including reinforcement learning with verifiable rewards (RLVR), where alignment feedback is given only as scalar rewards. We introduce $f$-Group Relative Policy Optimization ($f$-GRPO), a class of on-policy RL objectives, and $f$-Hybrid Alignment Loss ($f$-HAL), which combines on-policy reward optimization with off-policy preference supervision. We show that these objectives estimate $f$-divergences between reward-aligned & reward-unaligned distributions induced by above- & below-average reward responses, and prove expected reward improvement after alignment. Empirically, $f$-GRPO improves over GRPO on math-reasoning RLVR tasks, while hybrid $f$-HAL mitigates reward hacking in on-policy safety alignment when verifiable rewards are unavailable and learned reward models must be used.
翻译:近期研究表明,偏好对齐目标可被解释为对齐(偏好)分布与未对齐(非偏好)分布间的散度估计器,这为设计对齐损失函数提供了理论框架。然而,该视角目前仅局限于基于偏好的监督信号。本文将其扩展至通用大语言模型对齐任务,包括可验证奖励强化学习(RLVR),其中对齐反馈仅以标量奖励形式呈现。我们提出$f$-群体相对策略优化($f$-GRPO),一类在线策略强化学习目标函数,以及$f$-混合对齐损失($f$-HAL),该损失将在线策略奖励优化与离线策略偏好监督相结合。理论证明,这些目标函数估计奖励对齐分布与奖励未对齐分布(由高于/低于平均奖励响应诱导)之间的$f$-散度,并证实在对齐后奖励期望的改善。实验表明,在数学推理RLVR任务中,$f$-GRPO相比GRPO取得显著提升;而当可验证奖励不可用时,需采用学习型奖励模型进行在线策略安全对齐时,混合型$f$-HAL能有效缓解奖励欺骗现象。