Reinforcement Learning with Verifiable reward (RLVR) on preference data has become the mainstream approach for training Generative Reward Models (GRMs). Typically in pairwise rewarding tasks, GRMs generate reasoning chains ending with critiques and preference labels, and RLVR then relies on the correctness of the preference labels as the training reward. However, in this paper, we demonstrate that such binary classification tasks make GRMs susceptible to guessing correct outcomes without sound critiques. Consequently, these spurious successes introduce substantial noise into the reward signal, thereby impairing the effectiveness of reinforcement learning. To address this issue, we propose Reward Modeling from Natural Language Human Feedback (RM-NLHF), which leverages natural language feedback to obtain process reward signals, thereby mitigating the problem of limited solution space inherent in binary tasks. Specifically, we compute the similarity between GRM-generated and human critiques as the training reward, which provides more accurate reward signals than outcome-only supervision. Additionally, considering that human critiques are difficult to scale up, we introduce Meta Reward Model (MetaRM) which learns to predict process reward from datasets with human critiques and then generalizes to data without human critiques. Experiments on multiple benchmarks demonstrate that our method consistently outperforms state-of-the-art GRMs trained with outcome-only reward, confirming the superiority of integrating natural language over binary human feedback as supervision.
翻译:基于偏好数据的可验证奖励强化学习已成为训练生成式奖励模型的主流方法。在典型的成对奖励任务中,生成式奖励模型生成以评论和偏好标签结尾的推理链,随后可验证奖励强化学习依赖偏好标签的正确性作为训练奖励。然而,本文证明此类二元分类任务易使生成式奖励模型在缺乏合理评论的情况下猜测出正确结果。这些虚假成功会向奖励信号引入显著噪声,从而损害强化学习的有效性。为解决该问题,我们提出基于自然语言人类反馈的奖励建模方法,该方法利用自然语言反馈获取过程奖励信号,从而缓解二元任务中解空间有限的问题。具体而言,我们通过计算生成式奖励模型生成的评论与人类评论之间的相似度作为训练奖励,这比仅基于结果的监督能提供更精确的奖励信号。此外,考虑到人类评论难以规模化获取,我们提出元奖励模型,该模型通过含人类评论的数据集学习预测过程奖励,并泛化至无人为评论的数据。在多个基准测试上的实验表明,我们的方法始终优于采用纯结果奖励训练的最先进生成式奖励模型,证实了将自然语言反馈作为监督信号相较于二元人类反馈的优越性。