Toward Preference-aligned Large Language Models via Residual-based Model Steering

Preference alignment is a critical step in making Large Language Models (LLMs) useful and aligned with (human) preferences. Existing approaches such as Reinforcement Learning from Human Feedback or Direct Preference Optimization typically require curated data and expensive optimization over billions of parameters, and eventually lead to persistent task-specific models. In this work, we introduce Preference alignment of Large Language Models via Residual Steering (PaLRS), a training-free method that exploits preference signals encoded in the residual streams of LLMs. From as few as one hundred preference pairs, PaLRS extracts lightweight, plug-and-play steering vectors that can be applied at inference time to push models toward preferred behaviors. We evaluate PaLRS on various small-to-medium-scale open-source LLMs, showing that PaLRS-aligned models achieve consistent gains on mathematical reasoning and code generation benchmarks while preserving baseline general-purpose performance. Moreover, when compared to models aligned with DPO and SimPO, they perform better with great time-savings. Our findings highlight that PaLRS offers an effective, much more efficient and flexible alternative to standard preference optimization pipelines, offering a training-free, plug-and-play mechanism for alignment with minimal data.

翻译：偏好对齐是使大语言模型（LLMs）有用并与（人类）偏好保持一致的关键步骤。现有的方法，如基于人类反馈的强化学习或直接偏好优化，通常需要精心准备的数据和数十亿参数的昂贵优化，最终形成针对特定任务的持久模型。在本工作中，我们提出了基于残差引导的大语言模型偏好对齐方法（PaLRS），这是一种免训练的方法，利用了大语言模型残差流中编码的偏好信号。PaLRS仅需从少至一百个偏好对中提取轻量级、即插即用的引导向量，这些向量可在推理时应用，以推动模型朝向偏好行为。我们在多种中小型开源大语言模型上评估了PaLRS，结果显示，PaLRS对齐的模型在数学推理和代码生成基准测试中取得了持续提升，同时保持了基线通用性能。此外，与使用DPO和SimPO对齐的模型相比，PaLRS对齐的模型表现更优且大幅节省时间。我们的研究结果表明，PaLRS为标准偏好优化流程提供了一种高效、更灵活且能节省时间的替代方案，提供了一种免训练、即插即用的对齐机制，且所需数据极少。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

大型语言模型中隐性与显性偏见的综合研究

专知会员服务

17+阅读 · 2025年11月25日