User intentions are typically formalized as evaluation rewards to be maximized when fine-tuning language models (LMs). Existing alignment methods, such as Direct Preference Optimization (DPO), are mainly tailored for pairwise preference data where rewards are implicitly defined rather than explicitly given. In this paper, we introduce a general framework for LM alignment, leveraging Noise Contrastive Estimation (NCE) to bridge the gap in handling reward datasets explicitly annotated with scalar evaluations. Our framework comprises two parallel algorithms, NCA and InfoNCA, both enabling the direct extraction of an LM policy from reward data as well as preference data. Notably, we show that the DPO loss is a special case of our proposed InfoNCA objective under pairwise preference settings, thereby integrating and extending current alignment theories. By contrasting NCA and InfoNCA, we show that InfoNCA and DPO adjust relative likelihood across different responses to a single instruction, while NCA optimizes absolute likelihood for each response. We apply our methods to align a 7B language model with a GPT-4 annotated reward dataset. Experimental results suggest that InfoNCA surpasses the DPO baseline in GPT-4 evaluations, while NCA enjoys better training stability with competitive performance.
翻译:用户意图通常被形式化为评估奖励,在微调语言模型时需最大化这类奖励。现有的对齐方法(如直接偏好优化,DPO)主要针对成对偏好数据设计,其中奖励被隐式定义而非显式给出。本文提出一种通用的语言模型对齐框架,利用噪声对比估计(NCE)弥合处理显式标注标量评估的奖励数据集时的差距。该框架包含两种并行算法——NCA和InfoNCA,两者均可直接从奖励数据和偏好数据中提取语言模型策略。值得注意的是,我们证明在成对偏好设定下,DPO损失是本文提出的InfoNCA目标函数的一个特例,从而整合并拓展了现有对齐理论。通过对比NCA与InfoNCA,我们揭示InfoNCA和DPO调整同一指令下不同响应的相对似然,而NCA则优化每个响应的绝对似然。我们将方法应用于对齐7B语言模型与GPT-4标注的奖励数据集。实验结果表明,InfoNCA在GPT-4评估中超过DPO基线,而NCA在保持竞争力的同时展现出更优的训练稳定性。