Large language models can resist task-misaligned activation steering during inference, sometimes recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B shows substantial ESR, while smaller models from the Llama-3 and Gemma-2 families exhibit the phenomenon less frequently. We identify 26 SAE latents that activate differentially during off-topic content and are causally linked to ESR in Llama-3.3-70B. Zero-ablating these latents reduces the multi-attempt rate by 25%, providing causal evidence for dedicated internal consistency-checking circuits. We demonstrate that ESR can be deliberately enhanced through both prompting and training: meta-prompts instructing the model to self-monitor increase the multi-attempt rate by 4x for Llama-3.3-70B, and fine-tuning on self-correction examples successfully induces ESR-like behavior in smaller models. These findings have dual implications: ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions that rely on activation steering. Understanding and controlling these resistance mechanisms is important for developing transparent and controllable AI systems. Code is available at github.com/agencyenterprise/endogenous-steering-resistance.
翻译:大型语言模型在推理过程中能够抵抗任务未对齐的激活引导,有时甚至在引导持续激活的情况下,仍能在生成过程中恢复并产生改进的响应。我们将此现象称为内源性引导抵抗(ESR)。通过使用稀疏自编码器(SAE)潜在变量引导模型激活,我们发现Llama-3.3-70B展现出显著的ESR,而Llama-3和Gemma-2系列中的较小模型则较少出现此现象。我们识别出26个在离题内容生成期间差异激活的SAE潜在变量,这些变量与Llama-3.3-70B的ESR存在因果关联。对这些潜在变量进行零值消融可使多轮尝试率降低25%,这为专用的内部一致性检查电路提供了因果性证据。我们证明ESR可以通过提示和训练两种方式被刻意增强:指示模型进行自我监控的元提示使Llama-3.3-70B的多轮尝试率提升4倍,而在自我纠正示例上的微调成功在较小模型中诱导出类ESR行为。这些发现具有双重意义:ESR可能抵御对抗性操控,但也可能干扰依赖激活引导的有益安全干预措施。理解和控制这些抵抗机制对于开发透明可控的AI系统至关重要。代码发布于github.com/agencyenterprise/endogenous-steering-resistance。