Since the debut of DPO, it has been shown that aligning a target LLM with human preferences via the KL-constrained RLHF loss is mathematically equivalent to a special kind of reward modeling task. Concretely, the task requires: 1) using the target LLM to parameterize the reward model, and 2) tuning the reward model so that it has a 1:1 linear relationship with the true reward. However, we identify a significant issue: the DPO loss might have multiple minimizers, of which only one satisfies the required linearity condition. The problem arises from a well-known issue of the underlying Bradley-Terry preference model: it does not always have a unique maximum likelihood estimator (MLE). Consequently,the minimizer of the RLHF loss might be unattainable because it is merely one among many minimizers of the DPO loss. As a better alternative, we propose an energy-based model (EBM) that always has a unique MLE, inherently satisfying the linearity requirement. To approximate the MLE in practice, we propose a contrastive loss named Energy Preference Alignment (EPA), wherein each positive sample is contrasted against one or more strong negatives as well as many free weak negatives. Theoretical properties of our EBM enable the approximation error of EPA to almost surely vanish when a sufficient number of negatives are used. Empirically, we demonstrate that EPA consistently delivers better performance on open benchmarks compared to DPO, thereby showing the superiority of our EBM.
翻译:自DPO问世以来,已有研究表明,通过KL约束的RLHF损失将目标大语言模型与人类偏好对齐,在数学上等价于一类特殊的奖励建模任务。具体而言,该任务要求:1)使用目标大语言模型对奖励模型进行参数化;2)调整奖励模型,使其与真实奖励呈1:1线性关系。然而,我们发现一个关键问题:DPO损失可能存在多个极小值点,其中仅有一个满足所需的线性条件。该问题源于其底层Bradley-Terry偏好模型的固有缺陷:该模型并不总是存在唯一的最大似然估计量。因此,RLHF损失的极小值点可能无法达到,因为它仅是DPO损失众多极小值点中的一个。作为更优的替代方案,我们提出一种基于能量的模型,该模型始终具有唯一的最大似然估计量,天然满足线性要求。为在实际中近似该最大似然估计量,我们提出一种名为能量偏好对齐的对比损失函数,其中每个正样本均与一个或多个强负样本以及大量自由的弱负样本进行对比。我们基于能量的模型的理论特性使得,当使用足够数量的负样本时,能量偏好对齐的近似误差几乎必然趋近于零。实证结果表明,在公开基准测试中,能量偏好对齐始终优于DPO,从而验证了我们基于能量的模型的优越性。