Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.
翻译:直接对齐方法正日益广泛地用于将大语言模型(LLM)与人类偏好对齐。然而,许多现实世界的对齐问题涉及多个相互冲突的目标,此时对偏好进行简单聚合可能导致训练不稳定和权衡结果不佳。具体而言,加权损失方法可能无法识别出能同时改进所有目标的更新方向,而现有的多目标方法通常依赖于显式的奖励模型,这引入了额外的复杂性并可能扭曲用户指定的偏好。本文的贡献有两个方面。首先,我们提出了一种面向冲突目标的无奖励对齐框架(RACO),该框架直接利用成对偏好数据,并通过一种新颖的冲突规避梯度下降的裁剪变体来解决梯度冲突。我们提供了收敛到尊重用户指定目标权重的帕累托临界点的保证,并进一步证明在双目标设置中,裁剪操作可以严格提高收敛速率。其次,我们利用一些启发式方法改进了我们的方法,并通过实验证明了所提框架在大语言模型对齐任务中的适用性。在多目标摘要生成和安全对齐任务上,针对多个大语言模型系列(Qwen 3, Llama 3, Gemma 3)进行的定性和定量评估均表明,与现有的多目标对齐基线方法相比,我们的方法能持续实现更优的帕累托权衡。