Language model alignment is a cutting-edge technique in large language model training to align the model output to user's intent, e.g., being helpful and harmless. Recent alignment framework consists of two steps: supervised fine-tuning with demonstration data and preference learning with human preference data. Previous preference learning methods, such as RLHF and DPO, mainly focus on pair-wise preference data. However, in many real-world scenarios where human feedbacks are intrinsically point-wise, these methods will suffer from information loss or even fail. To fill this gap, in this paper, we first develop a preference learning method called point-wise DPO to tackle point-wise preference data. Further revelation on the connection between supervised fine-tuning and point-wise preference learning enables us to develop a unified framework for both human demonstration and point-wise preference data, which sheds new light on the construction of preference dataset. Extensive experiments on point-wise datasets with binary or continuous labels demonstrate the superior performance and efficiency of our proposed methods. A new dataset with high-quality demonstration samples on harmlessness is constructed and made publicly available.
翻译:语言模型对齐是大语言模型训练中的一项前沿技术,旨在使模型输出与用户意图(如助益性和无害性)保持一致。近年来的对齐框架包含两个步骤:使用示范数据进行监督微调,以及利用人类偏好数据进行偏好学习。以往的偏好学习方法(如RLHF和DPO)主要聚焦于成对偏好数据。然而,在许多真实场景中,人类反馈本质上是逐点的,这些方法会遭受信息损失甚至失效。为填补这一空白,本文首先开发了一种名为"逐点DPO"的偏好学习方法,以处理逐点偏好数据。进一步揭示监督微调与逐点偏好学习之间的联系,使我们能够构建一个统一框架,同时处理人类示范数据和逐点偏好数据,这为偏好数据集的构建提供了新思路。在具有二元或连续标签的逐点数据集上进行的大量实验表明,我们提出的方法具有优越的性能和效率。我们还构建了一个包含高质量无害性示范样本的新数据集,并将其公开。