User intentions are typically formalized as evaluation rewards to be maximized when fine-tuning language models (LMs). Existing alignment methods, such as Direct Preference Optimization (DPO), are mainly tailored for pairwise preference data where rewards are implicitly defined rather than explicitly given. In this paper, we introduce a general framework for LM alignment, leveraging Noise Contrastive Estimation (NCE) to bridge the gap in handling reward datasets explicitly annotated with scalar evaluations. Our framework comprises two parallel algorithms, NCA and InfoNCA, both enabling the direct extraction of an LM policy from reward data as well as preference data. Notably, we show that the DPO loss is a special case of our proposed InfoNCA objective under pairwise preference settings, thereby integrating and extending current alignment theories. By comparing NCA and InfoNCA, we demonstrate that the well-observed decreasing-likelihood trend of DPO/InfoNCA is caused by their focus on adjusting relative likelihood across different responses. In contrast, NCA optimizes the absolute likelihood for each response, thereby effectively preventing the chosen likelihood from decreasing. We evaluate our methods in both reward and preference settings with Mistral-8*7B and 7B models. Experiments suggest that InfoNCA/NCA surpasses various preference baselines when reward datasets are available. We also find NCA significantly outperforms DPO in complex reasoning tasks like math and coding.
翻译:用户意图通常被形式化为评估奖励,在微调语言模型(LMs)时需最大化该奖励。现有的对齐方法,例如直接偏好优化(DPO),主要针对奖励被隐式定义而非显式给出的成对偏好数据进行设计。本文提出一个通用的语言模型对齐框架,利用噪声对比估计(NCE)来弥合处理带有标量评估显式标注的奖励数据集时的差距。我们的框架包含两个并行算法:NCA 与 InfoNCA,两者均支持直接从奖励数据以及偏好数据中提取语言模型策略。值得注意的是,我们证明了在成对偏好设置下,DPO 损失是我们提出的 InfoNCA 目标函数的一个特例,从而整合并扩展了当前的对齐理论。通过比较 NCA 与 InfoNCA,我们证明了 DPO/InfoNCA 中观察到的似然下降趋势,源于它们专注于调整不同响应之间的相对似然。相比之下,NCA 优化每个响应的绝对似然,从而有效防止所选响应的似然下降。我们在奖励和偏好两种设置下,使用 Mistral-8*7B 和 7B 模型评估了我们的方法。实验表明,当可获得奖励数据集时,InfoNCA/NCA 超越了多种偏好基线方法。我们还发现,在数学和编程等复杂推理任务中,NCA 显著优于 DPO。