General Preference Reinforcement Learning

Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.

翻译：后训练阶段将大语言模型的对齐工作割裂为两个近乎独立的方向。基于可验证奖励的在线强化学习虽能激发数学与代码领域的推理涌现能力，却依赖仅能处理封闭式任务的程序化验证器；而偏好优化虽能应对开放式生成任务，却放弃了驱动在线强化学习的持续探索机制。要弥合这一鸿沟，需要为开放式质量设计验证方案，但标量奖励模型并不适合此任务。质量具有多维度特性，任何标量评分都只是不完整的代理指标，会让在线强化学习聚焦于该评分最敏感的单一维度上。我们转而采用通用偏好模型：该模型将响应嵌入$k$个斜对称子空间，将偏好表示为具有结构化、非传递性意识的比较。在此基础上，我们提出通用偏好强化学习，将$k$维结构贯穿至策略更新过程。GPRL计算各维度的分组相对优势度，通过独立归一化避免任何维度主导，并依据上下文相关特征值进行聚合。该结构同时驱动闭环漂移监控器，可检测单维度过度利用现象，并通过动态调整维度权重与收紧信任域进行实时修正。基于$\texttt{Llama-3-8B-Instruct}$的测试显示，GPRL在AlpacaEval 2.0上达到56.51%的长度控制胜率，同时在Arena-Hard、MT-Bench和WildBench上通过抵抗长时间训练中的奖励破解现象，超越SimPO和SPPO方法。