Reinforcement Learning from Human Feedback (RLHF) remains indispensable for aligning large language models (LLMs) in subjective domains. To enhance robustness, recent work shifts toward Generative Reward Models (GenRMs) that generate rationales before predicting preferences. Yet in GenRM training and evaluation, practice remains outcome-label-only, leaving reasoning quality unchecked. We show that reasoning fidelity-the consistency between a GenRM's preference decision and reference decision rationales-is highly predictive of downstream RLHF outcomes, beyond standard label accuracy. Specifically, we repurpose existing reward-model benchmarks to compute Spurious Correctness (S-Corr)-the fraction of label-correct decisions with rationales misaligned with golden judgments. Our empirical evaluation reveals substantial S-Corr even for competitive GenRMs, and higher S-Corr is associated with policy degeneration under optimization. To improve fidelity, we propose Rationale-Centric Alignment, R-Align, which augments training with gold judgments and explicitly supervises rationale alignment. R-Align reduces S-Corr on RM benchmarks and yields consistent gains in actor performance across STEM, coding, instruction following, and general tasks.
翻译:基于人类反馈的强化学习(RLHF)在主观领域对齐大语言模型(LLM)方面仍然不可或缺。为提高鲁棒性,近期研究转向生成式奖励模型(GenRM),该类模型在预测偏好前会生成决策原理。然而,在GenRM的训练与评估中,实践仍仅依赖结果标签,未对推理质量进行核查。我们证明,推理保真度——即GenRM的偏好决策与参考决策原理之间的一致性——对下游RLHF结果的预测能力远超标准标签准确率。具体而言,我们重新利用现有奖励模型基准来计算伪正确率(S-Corr),即标签正确但其原理与黄金判断不一致的决策比例。实证评估表明,即使对于具有竞争力的GenRM,S-Corr依然显著,且更高的S-Corr与优化过程中的策略退化相关。为提升保真度,我们提出基于原理的对齐方法R-Align,该方法通过引入黄金判断增强训练,并显式监督原理对齐。R-Align在奖励模型基准上降低了S-Corr,并在STEM、代码生成、指令遵循及通用任务中带来持续的性能提升。