Reward modeling represents a long-standing challenge in reinforcement learning from human feedback (RLHF) for aligning language models. Current reward modeling is heavily contingent upon experimental feedback data with high collection costs. In this work, we study \textit{implicit reward modeling} -- learning reward models from implicit human feedback (e.g., clicks and copies) -- as a cost-effective alternative. We identify two fundamental challenges in implicit reward modeling: (1) Implicit preference data lacks definitive negative samples, which makes standard positive-negative classification methods inapplicable; (2) Implicit preference data suffers from user preference bias, where different responses have different propensities to elicit user feedback actions, which exacerbates the difficulty of distinguishing definitive negative samples. To address these challenges, we propose ImplicitRM, which aims to learn unbiased reward models from implicit preference data. ImplicitRM stratifies training samples into four latent groups via a stratification model. Building on this, it derives a learning objective through likelihood maximization, which we prove is theoretically unbiased, effectively resolving both challenges. Experiments demonstrate that ImplicitRM learns accurate reward models across implicit preference datasets. Code is available on our project website.
翻译:奖励建模是语言模型对齐中基于人类反馈的强化学习(RLHF)领域的长期挑战。当前奖励建模高度依赖于采集成本高昂的实验性反馈数据。本研究探索了一种经济高效的替代方案——从隐式人类反馈(如点击和复制行为)中学习奖励模型的隐式奖励建模。我们发现了隐式奖励建模中两个核心挑战:(1) 隐式偏好数据缺乏明确负样本,导致标准的正负分类方法失效;(2) 隐式偏好数据存在用户偏好偏差,不同响应引发用户反馈行为的倾向性不同,加剧了区分明确负样本的难度。针对这些挑战,我们提出ImplicitRM方法,旨在从隐式偏好数据中学习无偏奖励模型。ImplicitRM通过分层模型将训练样本划分为四个潜在组别,并基于此通过极大似然估计推导出学习目标——我们从理论上证明该目标具有无偏性,从而有效解决了上述两个挑战。实验表明,ImplicitRM能在多种隐式偏好数据集上训练出准确的奖励模型。代码已发布于项目网站。