Reward models are crucial for aligning large language models (LLMs) with human values and intentions. Existing approaches follow either Generative (GRMs) or Discriminative (DRMs) paradigms, yet both suffer from limitations: GRMs typically demand costly point-wise supervision, while DRMs produce uncalibrated relative scores that lack probabilistic interpretation. To address these challenges, we introduce a novel reward modeling paradigm: Probabilistic Reward Model (PRM). Instead of modeling reward as a deterministic scalar, our approach treats it as a random variable, learning a full probability distribution for the quality of each response. To make this paradigm practical, we present its closed-form, discrete realization: the Ordinal Probabilistic Reward Model (OPRM), which discretizes the quality score into a finite set of ordinal ratings. Building on OPRM, we propose a data-efficient training strategy called Region Flooding Tuning (RgFT). It enables rewards to better reflect absolute text quality by incorporating quality-level annotations, which guide the model to concentrate the probability mass within corresponding rating sub-regions. Experiments on various reward model benchmarks show that our method improves accuracy by $\textbf{2.9%}\sim\textbf{7.4%}$ compared to prior reward models, demonstrating strong performance and data efficiency. Analysis of the score distribution provides evidence that our method captures not only relative rankings but also absolute quality.
翻译:奖励模型对于将大型语言模型(LLM)与人类价值观和意图对齐至关重要。现有方法遵循生成式(GRM)或判别式(DRM)范式,但两者均存在局限:GRM通常需要成本高昂的点式监督,而DRM产生未校准的相对分数,缺乏概率解释。为应对这些挑战,我们引入了一种新颖的奖励建模范式:概率奖励模型(PRM)。我们的方法不将奖励建模为确定性标量,而是将其视为随机变量,学习每个响应质量的完整概率分布。为使该范式实用化,我们提出了其闭式离散实现:序数概率奖励模型(OPRM),它将质量分数离散化为有限的序数评级集合。基于OPRM,我们提出了一种数据高效的训练策略,称为区域泛化调优(RgFT)。该策略通过融入质量级别标注,引导模型将概率质量集中在相应的评级子区域内,从而使奖励能更好地反映文本的绝对质量。在多个奖励模型基准上的实验表明,相较于先前的奖励模型,我们的方法将准确率提高了$\textbf{2.9%}\sim\textbf{7.4%}$,展现了强大的性能和数据效率。对分数分布的分析证明,我们的方法不仅能捕捉相对排序,还能反映绝对质量。