Recent advancements in large language models (LLMs) aim to tackle heterogeneous human expectations and values via multi-objective preference alignment. However, existing methods are parameter-adherent to the policy model, leading to two key limitations: (1) the high-cost repetition of their alignment algorithms for each new target model; (2) they cannot expand to unseen objectives due to their static alignment objectives. In this work, we propose Meta-Objective Aligner (MetaAligner), a model that performs conditional weak-to-strong correction for weak responses to approach strong responses. MetaAligner is the first policy-agnostic and generalizable method for multi-objective preference alignment, which enables plug-and-play alignment by decoupling parameter updates from the policy models and facilitates zero-shot preference alignment for unseen objectives via in-context learning. Experimental results show that MetaAligner achieves significant and balanced improvements in multi-objective alignments on 10 state-of-the-art policy models, and outperforms previous alignment methods with down to 15.71x less GPU training hours. The model also effectively aligns unseen objectives, marking the first step towards generalizable multi-objective preference alignment.
翻译:近期大语言模型(LLMs)的研究进展通过多目标偏好对齐技术,致力于满足人类异质化期望与价值观。然而现有方法因参数与策略模型强耦合,存在两大局限:(1)每训练新目标模型均需重复执行高成本对齐算法;(2)静态对齐目标无法扩展至未见目标。本文提出元目标对齐器(MetaAligner),该模型能对弱回复实施条件式弱到强修正,使其趋近强回复。MetaAligner是首个策略不可知且具备通用性的多目标偏好对齐方法,通过将参数更新与策略模型解耦实现即插即用对齐,并借助上下文学习实现未见目标的零样本偏好对齐。实验结果表明,在10个先进策略模型上,MetaAligner在多项对齐任务中取得显著且均衡的提升,以低至15.71倍的GPU训练时长超越既往对齐方法。该模型还能有效对齐未见目标,标志着通用化多目标偏好对齐研究迈出第一步。