Alignment in artificial intelligence pursues the consistency between model responses and human preferences as well as values. In practice, the multifaceted nature of human preferences inadvertently introduces what is known as the "alignment tax" -a compromise where enhancements in alignment within one objective (e.g.,harmlessness) can diminish performance in others (e.g.,helpfulness). However, existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. To navigate this challenge, we argue the prominence of grounding LLMs with evident preferences. We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives, thereby guiding the model to generate responses that meet the requirements. Our experimental analysis reveals that the aligned models can provide responses that match various preferences among the "3H" (helpfulness, honesty, harmlessness) desiderata. Furthermore, by introducing diverse data and alignment goals, we surpass baseline methods in aligning with single objectives, hence mitigating the impact of the alignment tax and achieving Pareto improvements in multi-objective alignment.
翻译:人工智能对齐旨在追求模型响应与人类偏好及价值观的一致性。实践中,人类偏好的多面性会无意中引入所谓的“对齐税”——即提升某一目标(如无害性)的对齐程度时,可能导致其他目标(如助益性)的性能下降。然而,现有对齐技术多为单向优化,导致权衡结果欠佳且对不同目标的适应性较差。为应对这一挑战,我们认为有必要为大型语言模型建立显式的偏好基础。本文提出可控偏好优化(CPO)方法,通过显式指定不同目标的偏好分数,引导模型生成符合要求的响应。实验分析表明,经对齐的模型能够在“3H”(助益性、诚实性、无害性)需求框架下生成匹配不同偏好的响应。此外,通过引入多样化数据与对齐目标,我们在单目标对齐任务上超越了基线方法,从而缓解了对齐税的影响,实现了多目标对齐的帕累托改进。