Language models (LMs), despite aligning well with an average labeler through reinforcement learning from human feedback (RLHF), may not universally suit diverse human preferences. Recent approaches therefore opt for customization by collecting multi-dimensional feedback and creating distinct rewards for each dimension (e.g., helpfulness, harmlessness, honesty). LMs can then be tailored to different preferences using multi-objective RL (MORL) with different reward weightings. Yet, RL fine-tuning is unstable and resource-heavy, especially for MORLHF with diverse and usually conflicting objectives. In this paper, we present Multi-Objective Direct Preference Optimization (MODPO), an RL-free algorithm that extends Direct Preference Optimization (DPO) for multiple alignment objectives. Essentially, MODPO trains different LMs to represent different collective reward models that combine all objectives with specific weightings. With a simple cross-entropy loss, the LMs optimized against the MODPO objective are analytically the exact solutions of the original MORLHF objective. Empirical results in safety alignment and long-form question answering confirm that MODPO matches or outperforms existing methods, efficiently producing a Pareto-optimal set of LMs that cater to diverse preferences with 3 times less computational resources compared with MORLHF.
翻译:语言模型(LMs)通过人类反馈的强化学习(RLHF)虽能与平均标注者良好对齐,但可能无法普遍适应多样化的人类偏好。近期研究采用收集多维度反馈并为每个维度建立独立奖励(如有用性、无害性、诚实性)的定制化方法。通过使用不同奖励权重的多目标强化学习(MORL),语言模型可针对不同偏好进行定制。然而,RL微调存在不稳定性且资源消耗大,尤其对于目标多样且常相互冲突的MORLHF而言。本文提出“多目标直接偏好优化”(MODPO),一种无需RL的算法,将直接偏好优化(DPO)扩展至多对齐目标。本质上,MODPO训练不同语言模型来表征结合特定权重分配所有目标的集体奖励模型。通过简单的交叉熵损失函数,针对MODPO目标优化的语言模型在解析上恰为原始MORLHF目标的精确解。在安全对齐与长文本问答任务中的实证结果表明,MODPO性能匹配或超越现有方法,能以比MORLHF少3倍的计算资源高效生成满足多样化偏好的帕累托最优语言模型集合。