Aligning large language models (LLMs) to desirable human values requires balancing multiple, potentially conflicting objectives such as helpfulness, truthfulness, and harmlessness, which presents a multi-objective optimisation challenge. Most alignment pipelines rely on a fixed scalarisation of these objectives, which can introduce procedural unfairness by systematically under-weighting harder-to-optimise or minority objectives. To promote more equitable trade-offs, we introduce MGDA-Decoupled, a geometry-based multi-objective optimisation algorithm that finds a shared descent direction while explicitly accounting for each objective's convergence dynamics. In contrast to prior methods that depend on reinforcement learning (e.g., GAPO) or explicit reward models (e.g., MODPO), our approach operates entirely within the lightweight Direct Preference Optimisation (DPO) paradigm. Experiments on the UltraFeedback dataset show that geometry-aware methods -- and MGDA-Decoupled in particular -- achieve the highest win rates against golden responses, both overall and per objective.
翻译:将大型语言模型与人类期望的价值观对齐需要平衡多个可能相互冲突的目标,如有用性、真实性和无害性,这构成了一个多目标优化挑战。大多数对齐流程依赖这些目标的固定标量化处理,可能因系统性地弱化难以优化或少数群体的目标而引入程序性不公平。为促进更公平的权衡,我们提出了MGDA-Decoupled——一种基于几何的多目标优化算法,它在明确考虑每个目标收敛动态的同时,寻找共享下降方向。与依赖强化学习(如GAPO)或显式奖励模型(如MODPO)的先前方法不同,我们的方法完全在轻量级直接偏好优化(DPO)范式内运行。在UltraFeedback数据集上的实验表明,基于几何感知的方法(尤其是MGDA-Decoupled)在整体和每个目标上都获得了对黄金回答的最高胜率。