The Bradley-Terry (BT) model is a common and successful practice in reward modeling for Large Language Model (LLM) alignment. However, it remains unclear why this model -- originally developed for multi-player stochastic game matching -- can be adopted to convert pairwise response comparisons to reward values and make predictions. Especially given the fact that only a limited number of prompt-response pairs are sparsely compared with others. In this paper, we first revisit the foundations of using BT models in reward modeling, and establish the convergence rate of BT reward models based on deep neural networks using embeddings, providing a theoretical foundation for their use. Despite theoretically sound, we argue that the BT model is not a necessary choice from the perspective of downstream optimization. This is because a reward model only needs to preserve the correct ranking predictions through a monotonic transformation of the true reward. We highlight the critical concept of order consistency in reward modeling and demonstrate that the BT model possesses this property. Consequently, we propose a simple and straightforward upper-bound algorithm, compatible with off-the-shelf binary classifiers, as an alternative order-consistent reward modeling objective. To offer practical insights, we empirically evaluate the performance of these different reward modeling approaches across more than 12,000 experimental setups, using $6$ base LLMs, $2$ datasets, and diverse annotation designs that vary in quantity, quality, and pairing choices in preference annotations.
翻译:Bradley-Terry(BT)模型是大语言模型对齐中奖励建模的一种常见且成功的实践方法。然而,为何这一最初为多玩家随机博弈匹配开发的模型能够被用于将成对响应比较转换为奖励值并进行预测,目前仍不明确。特别是考虑到仅有有限数量的提示-响应对会与其他对进行稀疏比较这一事实。本文首先重新审视了在奖励建模中使用BT模型的基础,并基于使用嵌入的深度神经网络建立了BT奖励模型的收敛速率,为其使用提供了理论基础。尽管在理论上是合理的,我们认为从下游优化的角度来看,BT模型并非必然选择。这是因为奖励模型仅需通过真实奖励的单调变换来保持正确的排序预测。我们强调了奖励建模中顺序一致性的关键概念,并证明了BT模型具备这一性质。因此,我们提出了一种简单直接的上界算法,该算法与现成的二元分类器兼容,可作为替代的顺序一致性奖励建模目标。为了提供实践见解,我们在超过12,000个实验设置中,使用$6$个基础大语言模型、$2$个数据集以及偏好标注中数量、质量和配对选择各异的多样化标注设计,对这些不同的奖励建模方法进行了实证评估。