Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose f-Group Relative Policy Optimization (f-GRPO), a class of on-policy reinforcement learning, and f-Hybrid Alignment Loss (f-HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of f-divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.
翻译:近期研究表明,偏好对齐目标可视为对齐(被选中)与未对齐(被拒绝)响应分布之间的散度估计器。本工作将这一基于散度的视角扩展至通用对齐场景,例如仅存在环境奖励的可验证奖励强化学习。在此统一框架下,我们基于f-散度的变分表示,提出了适用于通用大语言模型对齐的f-群组相对策略优化(一类在线策略强化学习算法)与f-混合对齐损失(混合在线/离线策略目标)。我们提供了理论保证,证明这些目标函数类能在对齐后提升平均奖励。通过实验,我们在可验证奖励强化学习(数学推理)与偏好对齐任务(安全对齐)上验证了本框架,结果表明其相较于现有方法具有更优的性能与灵活性。