Aligning language models to human expectations, e.g., being helpful and harmless, has become a pressing challenge for large language models. A typical alignment procedure consists of supervised fine-tuning and preference learning. Most preference learning methods, such as RLHF and DPO, depend on pairwise preference data, which inadequately address scenarios where human feedback is point-wise, leading to potential information loss and suboptimal performance. Addressing this gap, we introduce Point-wise Direct Preference Optimization, a novel preference learning method designed to harness point-wise feedback effectively. Our work also uncovers a novel connection between supervised fine-tuning and point-wise preference learning, culminating in Unified Language Model Alignment, a single-step method that unifies the alignment with human demonstrations and point-wise preferences. Extensive experiments on point-wise preference datasets with binary or continuous labels validate the effectiveness of our methods. Our code and a new dataset with high-quality demonstration samples on harmlessness are released.
翻译:将语言模型与人类期望(如有用性与无害性)对齐已成为大型语言模型面临的紧迫挑战。典型对齐流程包括监督微调和偏好学习。多数偏好学习方法(如RLHF和DPO)依赖于成对偏好数据,难以有效处理人类反馈为点式评分的情境,导致潜在信息损失与次优性能。针对这一空白,我们提出点式直接偏好优化——一种有效利用点式反馈的新型偏好学习方法。本研究还揭示了监督微调与点式偏好学习之间的新颖联系,最终提出统一语言模型对齐方法,这是一种将人类示范对齐与点式偏好对齐融合为单一步骤的解决方案。在带有二元或连续标签的点式偏好数据集上的大量实验验证了该方法的有效性。我们已发布代码及一个包含高质量无害性示范样本的新数据集。