A single language model (LM), despite aligning well with an average labeler through reinforcement learning from human feedback (RLHF), may not universally suit diverse human preferences. Recent approaches therefore opt for customization by collecting multi-dimensional feedback and creating distinct reward models (RMs) for each dimension (e.g., helpfulness, harmlessness, or honesty). Different LMs can then be optimized for different preferences using multi-objective RLHF (MORLHF) with different reward weightings. Yet, RL fine-tuning is unstable and resource-heavy, especially for MORLHF with diverse and usually conflicting objectives. In this paper, we present Multi-Objective Direct Preference Optimization (MODPO), an RL-free algorithm that extends Direct Preference Optimization (DPO) for multiple alignment objectives with minimal overheads. Essentially, MODPO folds language modeling directly into reward modeling, training LMs as implicit collective reward models (cRMs) that combine all objectives with specific weightings. While theoretically guaranteed to produce the same optimal solutions as MORLHF, MODPO is practically more stable and computationally efficient. Empirical results from safety alignment and long-form question answering confirm that MODPO matches or outperforms existing methods, consistently producing a Pareto front of LMs that cater to diverse preferences with 3 times less computational resources compared to MORLHF.
翻译:单一语言模型(LM)尽管通过基于人类反馈的强化学习(RLHF)能与平均标注者良好对齐,但未必普遍适应多样的人类偏好。为此,近期方法通过收集多维反馈并为每一维度(如实用性、无害性或诚实性)创建独立的奖励模型(RM)来实现定制化。随后,可使用不同奖励权重的多目标RLHF(MORLHF)针对不同偏好优化不同的语言模型。然而,RL微调过程不稳定且资源消耗大,尤其当MORLHF包含多样且通常相互冲突的目标时问题更为突出。本文提出多目标直接偏好优化(MODPO),这是一种免RL算法,以极低开销将直接偏好优化(DPO)扩展至多对齐目标。本质上,MODPO将语言建模直接融入奖励建模,训练LM作为隐式集体奖励模型(cRM),以特定权重组合所有目标。理论保证下,MODPO能产生与MORLHF相同的最优解,而在实践中更稳定且计算效率更高。安全对齐与长文本问答任务的实证结果表明,MODPO可匹配或超越现有方法,并以比MORLHF少3倍的计算资源持续生成满足多样偏好的LM帕累托前沿。