A single language model (LM), despite aligning well with an average labeler through reinforcement learning from human feedback (RLHF), may not universally suit diverse human preferences. Recent approaches thus pursue customization, training separate principle-based reward models to represent different alignment objectives (e.g. helpfulness, harmlessness, or honesty). Different LMs can then be trained for different preferences through multi-objective RLHF (MORLHF) with different objective weightings. Yet, RLHF is unstable and resource-heavy, especially for MORLHF with diverse and usually conflicting objectives. In this paper, we present Multi-Objective Direct Preference Optimization (MODPO), an RL-free algorithm that extends Direct Preference Optimization (DPO) for multiple alignment objectives. Essentially, MODPO folds LM learning directly into reward modeling, aligning LMs with the weighted sum of all principle-based rewards using pure cross-entropy loss. While theoretically guaranteed to produce the same optimal solutions as MORLHF, MODPO is practically more stable and computationally efficient, obviating value function modeling and online sample collection. Empirical results in safety alignment and long-form question answering confirm that MODPO matches or outperforms existing methods, consistently producing one of the most competitive LM fronts that cater to diverse preferences with 3 times fewer computations compared with MORLHF.
翻译:单个语言模型尽管通过基于人类反馈的强化学习与平均标注者实现了良好对齐,却未必能普遍适应多样的人类偏好。因此,近期方法追求定制化,训练基于不同原则的独立奖励模型以代表不同的对齐目标(例如,有益性、无害性或诚实性)。随后,可通过具有不同目标权重的多目标强化学习训练出面向不同偏好的语言模型。然而,强化学习并不稳定且资源消耗较大,对于涉及多样且通常相互冲突目标的多目标强化学习而言尤其如此。在本文中,我们提出多目标直接偏好优化——一种无需强化学习的算法,它将直接偏好优化扩展至多个对齐目标。本质上,多目标直接偏好优化将语言模型学习直接融入奖励建模,利用纯交叉熵损失使语言模型与所有基于原则的奖励的加权和相对齐。尽管理论上保证能产生与多目标强化学习相同的最优解,但多目标直接偏好优化在实践中更稳定且计算效率更高,无需价值函数建模和在线样本收集。在安全对齐和长文本问答中的实证结果表明,多目标直接偏好优化匹配或超越了现有方法,能够持续生成一个最具竞争力的语言模型前沿,该前沿以比多目标强化学习少三倍的计算量满足多样偏好。