Steering vectors (SVs) offer a lightweight way to control large language models (LLMs) at inference time by shifting hidden activations, providing a practical middle ground between prompting and fine-tuning. Yet SVs can be unreliable in practice. Some concepts are unsteerable, and even when steering helps on average it can backfire for a non-trivial fraction of inputs. Reliability also degrades in long-form generation and multi-attribute steering. We take a geometric view of these failures. A static SV applies the same update vector everywhere in representation space, implicitly assuming that the concept-improving direction is constant across contexts. When the locally effective direction varies with the current activation, a single global vector can become misaligned, which yields weak or reversed effects. Guided by this perspective, we propose Steering Vector Fields (SVF), which learns a differentiable concept scoring function whose local gradient defines the steering direction at each activation, making interventions explicitly context-dependent. This formulation supports coordinated multi-layer interventions in a shared, aligned concept space, and enables efficient long-form and multi-attribute control within a unified framework. Across multiple LLMs and steering tasks, SVF delivers stronger and more reliable control, improving the practicality of inference-time steering.
翻译:导向向量(SV)通过偏移隐藏层激活,为大型语言模型(LLM)提供了一种轻量级的推理时控制方法,在提示工程与微调之间提供了一个实用的折中方案。然而,SV 在实践中可能不可靠。某些概念难以被导向,即使平均而言导向能带来改善,对于相当一部分输入也可能产生反效果。在生成长文本及进行多属性导向时,可靠性也会下降。本文从几何视角分析这些失效现象。静态 SV 在表示空间的各处应用相同的更新向量,这隐含地假设了概念改进方向在不同上下文中是恒定的。当局部有效方向随当前激活状态变化时,单一的全局向量可能发生错位,从而导致效果微弱甚至反向。基于这一视角,我们提出了导向向量场(SVF),该方法学习一个可微的概念评分函数,其局部梯度定义了每个激活状态处的导向方向,从而使干预显式地依赖于上下文。此公式支持在共享且对齐的概念空间中进行协调的多层干预,并在统一框架内实现了高效的长文本与多属性控制。在多种 LLM 和导向任务上的实验表明,SVF 能够提供更强且更可靠的控制,提升了推理时导向的实用性。