Ensuring that large language models (LLMs) are both helpful and harmless is a critical challenge, as overly strict constraints can lead to excessive refusals, while permissive models risk generating harmful content. Existing approaches, such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO), attempt to balance these trade-offs but suffer from performance conflicts, limited controllability, and poor extendability. To address these issues, we propose Preference Vector, a novel framework inspired by task arithmetic. Instead of optimizing multiple preferences within a single objective, we train separate models on individual preferences, extract behavior shifts as preference vectors, and dynamically merge them at test time. This modular approach enables fine-grained, user-controllable preference adjustments and facilitates seamless integration of new preferences without retraining. Experiments show that our proposed Preference Vector framework improves helpfulness without excessive conservatism, allows smooth control over preference trade-offs, and supports scalable multi-preference alignment.
翻译:确保大型语言模型(LLM)既助益又无害是一项关键挑战,因为过于严格的约束会导致过度拒绝,而宽松的模型则可能生成有害内容。现有方法,例如基于人类反馈的强化学习(RLHF)和直接偏好优化(DPO),试图平衡这些权衡,但存在性能冲突、可控性有限和可扩展性差的问题。为解决这些问题,我们提出了一种受任务算术启发的新框架——偏好向量。该框架并非在单一目标内优化多个偏好,而是针对个体偏好分别训练独立模型,将行为偏移提取为偏好向量,并在测试时动态融合它们。这种模块化方法实现了细粒度、用户可控的偏好调整,并支持无需重新训练即可无缝集成新偏好。实验表明,我们提出的偏好向量框架在不过度保守的前提下提升了助益性,允许平滑控制偏好权衡,并支持可扩展的多偏好对齐。