For aligning large language models (LLMs), prior work has leveraged reinforcement learning via human feedback (RLHF) or variations of direct preference optimization (DPO). While DPO offers a simpler framework based on maximum likelihood estimation, it compromises on the ability to easily tune language models to maximize auxiliary, non-preferential objectives according to the LLM designer's preferences (e.g., tuning lexical style or minimizing specific kinds of harmful content). Critically, these designer objectives may not be amply human-labeled or represented in available data, align with user preferences, or even be able to be captured tractably by binary preference pairs. To leverage the simplicity and performance of DPO with the generality of RL, we propose a unified approach. Based on a simple decomposition of preference and auxiliary objectives, we allow for tuning LLMs to optimize user and designer preferences without any additional specialized or preference data, computational cost, stability ``tweaks'', or training instability. The proposed method, Unified Preference Optimization, shows the ability to effectively generalize to user preferences and auxiliary objectives, while preserving or surpassing alignment performance on challenging benchmarks across a range of model sizes.
翻译:在对齐大型语言模型(LLMs)方面,先前的研究利用了基于人类反馈的强化学习(RLHF)或直接偏好优化(DPO)的变体。虽然DPO提供了一个基于最大似然估计的更简单框架,但它牺牲了根据LLM设计者偏好轻松调整语言模型以最大化辅助性、非偏好目标的能力(例如,调整词汇风格或最小化特定类型的有害内容)。关键的是,这些设计者目标可能没有充足的人工标注数据或未在可用数据中得到充分体现,可能与用户偏好不一致,甚至无法通过二元偏好对进行有效捕捉。为了结合DPO的简洁性与性能以及RL的通用性,我们提出了一种统一方法。基于对偏好目标和辅助目标的简单分解,我们能够调整LLMs以优化用户和设计者偏好,而无需任何额外的专用数据或偏好数据、额外的计算成本、稳定性“微调”或训练不稳定性。所提出的方法——统一偏好优化(Unified Preference Optimization)——在保持或超越一系列不同规模模型在具有挑战性的基准测试上的对齐性能的同时,展现出有效泛化到用户偏好和辅助目标的能力。