Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.
翻译:大型语言模型(LLM)的控制方法,包括局部权重微调、基于LoRA的适配以及基于激活的干预,通常被孤立研究,这掩盖了它们之间的联系并使得比较变得困难。在本工作中,我们提出一个统一视角,将这些干预框架化为由控制信号诱导的动态权重更新,并将其置于单一概念框架内。基于此视角,我们提出一种统一的偏好-效用分析,将控制效果分解为偏好(定义为朝向目标概念的倾向性)和效用(定义为连贯且任务有效的生成),并使用极性配对的对比示例在共享的对数几率尺度上度量二者。在不同方法中,我们观察到偏好与效用之间存在一致的权衡:更强的控制会增加偏好,同时可预测地降低效用。我们进一步通过激活流形视角解释此行为:控制沿目标概念方向移动表征以增强偏好,而效用下降主要发生在干预将表征推离模型的有效生成流形时。最后,我们基于此分析引入一种新的导向方法SPLIT,它在改善偏好的同时更好地保持了效用。代码可在 https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md 获取。