Alignment in artificial intelligence pursues the consistency between model responses and human preferences as well as values. In practice, the multifaceted nature of human preferences inadvertently introduces what is known as the "alignment tax" -a compromise where enhancements in alignment within one objective (e.g.,harmlessness) can diminish performance in others (e.g.,helpfulness). However, existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. To navigate this challenge, we argue the prominence of grounding LLMs with evident preferences. We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives, thereby guiding the model to generate responses that meet the requirements. Our experimental analysis reveals that the aligned models can provide responses that match various preferences among the "3H" (helpfulness, honesty, harmlessness) desiderata. Furthermore, by introducing diverse data and alignment goals, we surpass baseline methods in aligning with single objectives, hence mitigating the impact of the alignment tax and achieving Pareto improvements in multi-objective alignment.
翻译:人工智能中的对齐旨在追求模型响应与人类偏好及价值观之间的一致性。在实践中,人类偏好的多面性无意中引入了所谓的"对齐税"——即在某一目标(如无害性)上的对齐增强会降低其他目标(如有用性)的性能。然而,现有的对齐技术大多为单向的,导致在多目标间存在次优权衡且灵活性较差。为解决这一挑战,我们主张以明确的偏好为基础来构造大语言模型。我们引入了可控偏好优化(CPO),该方法通过明确指定不同目标的偏好分数,引导模型生成符合要求的响应。实验分析表明,对齐后的模型能够提供匹配"3H"(有用性、诚实性、无害性)准则中各种偏好的响应。此外,通过引入多样化的数据和对齐目标,我们在单一目标对齐上超越了基线方法,从而减轻了对齐税的影响,并在多目标对齐中实现了帕累托改进。