Large language models' behavior is often shaped by instructions such as system prompts, refusal boundaries, privacy constraints, and tool-use rules that must hold at inference time. Yet in practice these constraints can be violated under long contexts or when user-provided context conflicts with them, creating reliability and safety risks. This motivates inference-time interventions that strengthen instruction influence without retraining. One such intervention is attention steering, which biases attention toward instruction tokens. In this work, we present a unifying theory for attention steering methods by formalizing instruction following as rule-based competition between instruction rules and context-derived rules, with attention mediating which rules dominate. We prove that boosting attention to instruction tokens tilts this competition, making it harder for context to override instruction-following. However, excessive boosting can suppress task-relevant context that should be incorporated alongside the instruction. Guided by this theory, we propose Instruction Attention Boosting (InstABoost), a simple intervention that applies a constant additive bias to instruction-key attention logits across all layers and heads. We evaluate InstABoost against prompting, latent steering, and prior attention steering methods across 15 tasks. InstABoost matches or outperforms all baselines while avoiding the fluency collapse of latent methods and the instruction over-focus of prior attention methods, achieving a stronger steering-quality tradeoff.
翻译:大语言模型的行为常受制于推理时必须遵守的指令,如系统提示、拒绝边界、隐私约束和工具使用规则。然而在实践中,当长上下文或用户提供的上下文与这些指令冲突时,约束可能被违反,引发可靠性与安全风险。这促使我们探索无需重新训练即可在推理时增强指令影响力的干预方法。注意力引导(attention steering)便是此类干预之一,它通过将注意力偏向指令标记来发挥作用。本研究提出一种统一的注意力引导理论框架,将指令遵循形式化为基于规则的竞争机制——指令规则与上下文规则相互竞争,而注意力则调控哪类规则占据主导。我们证明,增强对指令标记的注意力可倾斜这种竞争,使上下文更难覆盖指令遵循行为。然而,过度增强可能抑制本应结合指令共同使用的任务相关上下文。基于该理论,我们提出指令注意力增强(InstABoost)方法——一种简单的干预手段,对所有层和注意力头中指令键的注意力对数施加恒定加性偏置。我们在15项任务中将InstABoost与提示工程、潜在空间引导及现有注意力引导方法进行对比。结果表明,InstABoost在所有基线方法中达到或超越最佳性能,同时避免了潜在空间方法导致的流畅性崩溃和现有注意力方法导致的指令过度聚焦问题,实现了更优的引导-质量平衡。