基于能量的偏好模型比Bradley-Terry偏好模型提供更优的离线对齐 (Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model)

Since the debut of DPO, it has been shown that aligning a target LLM with human preferences via the KL-constrained RLHF loss is mathematically equivalent to a special kind of reward modeling task. Concretely, the task requires: 1) using the target LLM to parameterize the reward model, and 2) tuning the reward model so that it has a 1:1 linear relationship with the true reward. However, we identify a significant issue: the DPO loss might have multiple minimizers, of which only one satisfies the required linearity condition. The problem arises from a well-known issue of the underlying Bradley-Terry preference model: it does not always have a unique maximum likelihood estimator (MLE). Consequently,the minimizer of the RLHF loss might be unattainable because it is merely one among many minimizers of the DPO loss. As a better alternative, we propose an energy-based model (EBM) that always has a unique MLE, inherently satisfying the linearity requirement. To approximate the MLE in practice, we propose a contrastive loss named Energy Preference Alignment (EPA), wherein each positive sample is contrasted against one or more strong negatives as well as many free weak negatives. Theoretical properties of our EBM enable the approximation error of EPA to almost surely vanish when a sufficient number of negatives are used. Empirically, we demonstrate that EPA consistently delivers better performance on open benchmarks compared to DPO, thereby showing the superiority of our EBM.

翻译：自DPO问世以来，已有研究表明，通过KL约束的RLHF损失将目标大语言模型与人类偏好对齐，在数学上等价于一类特殊的奖励建模任务。具体而言，该任务要求：1）使用目标大语言模型对奖励模型进行参数化；2）调整奖励模型，使其与真实奖励呈1:1线性关系。然而，我们发现一个关键问题：DPO损失可能存在多个极小值点，其中仅有一个满足所需的线性条件。该问题源于其底层Bradley-Terry偏好模型的固有缺陷：该模型并不总是存在唯一的最大似然估计量。因此，RLHF损失的极小值点可能无法达到，因为它仅是DPO损失众多极小值点中的一个。作为更优的替代方案，我们提出一种基于能量的模型，该模型始终具有唯一的最大似然估计量，天然满足线性要求。为在实际中近似该最大似然估计量，我们提出一种名为能量偏好对齐的对比损失函数，其中每个正样本均与一个或多个强负样本以及大量自由的弱负样本进行对比。我们基于能量的模型的理论特性使得，当使用足够数量的负样本时，能量偏好对齐的近似误差几乎必然趋近于零。实证结果表明，在公开基准测试中，能量偏好对齐始终优于DPO，从而验证了我们基于能量的模型的优越性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日