In this paper, we introduce \emph{refined Direct Preference Optimization} (rDPO), a method for improving the behavioral alignment of Large Language Models (LLMs) without the need for human-annotated data. The method involves creating synthetic data using self-critique prompting by a teacher LLM and then utilising a generalized DPO loss function to distil to a student LLM. The loss function incorporates an additional external reward model to improve the quality of synthetic data, making rDPO robust to potential noise in the synthetic dataset. rDPO is shown to be effective in a diverse set of behavioural alignment tasks, such as improved safety, robustness against role-playing, and reduced sycophancy. Code to be released at https://github.com/vicgalle/refined-dpo.
翻译:本文提出精炼直接偏好优化(rDPO)方法,旨在无需人工标注数据的情况下改善大型语言模型(LLM)的行为对齐。该方法通过教师LLM的自我批判提示生成合成数据,并利用广义DPO损失函数蒸馏至学生LLM。该损失函数引入额外外部奖励模型以提升合成数据质量,使rDPO对合成数据集中的潜在噪声具有鲁棒性。实验表明,rDPO在多样化行为对齐任务中表现有效,包括提升安全性、增强对抗角色扮演的鲁棒性以及减少谄媚行为。代码将发布于https://github.com/vicgalle/refined-dpo。