CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly conditions. In this work, we introduce observational reward modeling -- learning reward models with observational user feedback (e.g., clicks, copies, and upvotes) -- as a scalable and cost-effective alternative. We identify two fundamental challenges in this setting: (1) observational feedback is noisy due to annotation errors, which deviates it from true user preference; (2) observational feedback is biased by user preference, where users preferentially provide feedback on responses they feel strongly about, which creats a distribution shift between training and inference data. To address these challenges, we propose CausalRM, a causal-theoretic reward modeling framework that aims to learn unbiased reward models from observational feedback. To tackle challenge (1), CausalRM introduces a noise-aware surrogate loss term that is provably equivalent to the primal loss under noise-free conditions by explicitly modeling the annotation error generation process. To tackle challenge (2), CausalRM uses propensity scores -- the probability of a user providing feedback for a given response -- to reweight training samples, yielding a loss function that eliminates user preference bias. Extensive experiments across diverse LLM backbones and benchmark datasets validate that CausalRM effectively learns accurate reward signals from noisy and biased observational feedback and delivers substantial performance improvements on downstream RLHF tasks -- including a 49.2% gain on WildGuardMix and a 32.7% improvement on HarmBench. Code is available on our project website.

翻译：尽管基于人类反馈的强化学习（RLHF）在语言模型对齐方面取得了成功，但当前的奖励建模严重依赖从人类标注者在受控且昂贵的条件下收集的实验反馈数据。本文提出了一种观测奖励建模方法——利用观测性用户反馈（如点击、复制和点赞）学习奖励模型——作为可扩展且成本效益更高的替代方案。我们识别了该场景下的两个根本性挑战：（1）由于标注错误，观测反馈存在噪声，导致其偏离真实用户偏好；（2）观测反馈受用户偏好影响，用户倾向于对自己感受强烈的回复提供反馈，这造成了训练数据与推理数据之间的分布偏移。为解决这些问题，我们提出CausalRM，一种基于因果理论的奖励建模框架，旨在从观测反馈中学习无偏奖励模型。针对挑战（1），CausalRM通过显式建模标注错误的生成过程，引入了一个噪声感知的代理损失项，该损失项在无噪声条件下等价于原始损失。针对挑战（2），CausalRM使用倾向分数（即用户对给定回复提供反馈的概率）对训练样本进行加权，从而得到消除用户偏好偏差的损失函数。在多种大语言模型（LLM）骨干网络和基准数据集上的大量实验验证表明，CausalRM能够从含噪且有偏的观测反馈中有效学习准确的奖励信号，并在下游RLHF任务中带来显著的性能提升——包括WildGuardMix上49.2%的增益和HarmBench上32.7%的提升。代码已发布于项目网站。