Reward models (RMs) are central to reinforcement learning from human feedback (RLHF), providing the critical supervision signals that align large language models (LLMs) with human preferences. Generative reward models (GRMs) provide greater interpretability than traditional scalar RMs, but they come with a critical trade-off: pairwise methods are hindered by a training-inference mismatch, while pointwise methods require expensive absolute annotations. To bridge this gap, we propose the Preference-aware Task-adaptive Reward Model (PaTaRM). Unlike prior approaches, PaTaRM enables robust pointwise training using readily available pairwise data via a novel Preference-Aware Reward (PAR) mechanism, eliminating the need for explicit rating labels. Furthermore, it incorporates a Task-Adaptive Rubric system that dynamically generates instance-specific criteria for precise evaluation. Extensive experiments demonstrate that PATRM achieves a 8.7% average improvement on RewardBench and RMBench across Qwen3-8B/14B models. Crucially, it boosts downstream RLHF performance by an average relative improvement of 13.6% across IFEval and InFoBench, validating its effectiveness for policy alignment. Our code is available at https://github.com/JaneEyre0530/PaTaRM.
翻译:奖励模型(RMs)是强化学习从人类反馈(RLHF)的核心,它提供关键的监督信号,使大语言模型(LLMs)与人类偏好对齐。生成式奖励模型(GRMs)比传统的标量奖励模型具有更强的可解释性,但面临一个关键权衡:成对方法受限于训练-推理不匹配问题,而逐点方法则需要昂贵的绝对标注。为弥合这一差距,我们提出了偏好感知任务自适应奖励模型(PaTaRM)。与先前方法不同,PaTaRM通过一种新颖的偏好感知奖励(PAR)机制,能够利用现成的成对数据进行鲁棒的逐点训练,从而无需显式的评分标注。此外,它引入了一个任务自适应评分标准系统,能够动态生成针对具体实例的评估准则,以实现精确评估。大量实验表明,PaTaRM在Qwen3-8B/14B模型上,于RewardBench和RMBench基准上平均提升了8.7%。至关重要的是,它在下游RLHF性能上,于IFEval和InFoBench基准上平均实现了13.6%的相对提升,验证了其在策略对齐方面的有效性。我们的代码发布于 https://github.com/JaneEyre0530/PaTaRM。