Rethinking Bradley-Terry Models in Preference-Based Reward Modeling: Foundations, Theory, and Alternatives

The Bradley-Terry (BT) model is a common and successful practice in reward modeling for Large Language Model (LLM) alignment. However, it remains unclear why this model -- originally developed for multi-player stochastic game matching -- can be adopted to convert pairwise response comparisons to reward values and make predictions. Especially given the fact that only a limited number of prompt-response pairs are sparsely compared with others. In this paper, we first revisit the foundations of using BT models in reward modeling, and establish the convergence rate of BT reward models based on deep neural networks using embeddings, providing a theoretical foundation for their use. Despite theoretically sound, we argue that the BT model is not a necessary choice from the perspective of downstream optimization. This is because a reward model only needs to preserve the correct ranking predictions through a monotonic transformation of the true reward. We highlight the critical concept of order consistency in reward modeling and demonstrate that the BT model possesses this property. Consequently, we propose a simple and straightforward upper-bound algorithm, compatible with off-the-shelf binary classifiers, as an alternative order-consistent reward modeling objective. To offer practical insights, we empirically evaluate the performance of these different reward modeling approaches across more than 12,000 experimental setups, using $6$ base LLMs, $2$ datasets, and diverse annotation designs that vary in quantity, quality, and pairing choices in preference annotations.

翻译：Bradley-Terry（BT）模型是大语言模型对齐中奖励建模的一种常见且成功的实践方法。然而，为何这一最初为多玩家随机博弈匹配开发的模型能够被用于将成对响应比较转换为奖励值并进行预测，目前仍不明确。特别是考虑到仅有有限数量的提示-响应对会与其他对进行稀疏比较这一事实。本文首先重新审视了在奖励建模中使用BT模型的基础，并基于使用嵌入的深度神经网络建立了BT奖励模型的收敛速率，为其使用提供了理论基础。尽管在理论上是合理的，我们认为从下游优化的角度来看，BT模型并非必然选择。这是因为奖励模型仅需通过真实奖励的单调变换来保持正确的排序预测。我们强调了奖励建模中顺序一致性的关键概念，并证明了BT模型具备这一性质。因此，我们提出了一种简单直接的上界算法，该算法与现成的二元分类器兼容，可作为替代的顺序一致性奖励建模目标。为了提供实践见解，我们在超过12,000个实验设置中，使用$6$个基础大语言模型、$2$个数据集以及偏好标注中数量、质量和配对选择各异的多样化标注设计，对这些不同的奖励建模方法进行了实证评估。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日