Intrinsic self-correction refers to the phenomenon where a language model refines its own outputs purely through prompting, without external feedback or parameter updates. While this approach improves performance across diverse tasks, its mechanism remains unclear. We show that intrinsic self-correction functions by steering hidden representations along interpretable latent directions, as evidenced by both alignment analysis and activation interventions. To achieve this, we analyze intrinsic self-correction via the representation shift induced by prompting. In parallel, we construct interpretable latent directions with contrastive pairs and verify the causal effect of these directions via activation addition. Evaluating six open-source LLMs, our results demonstrate that prompt-induced representation shifts in text detoxification and text toxification consistently align with latent directions constructed from contrastive pairs. In detoxification, the shifts align with the non-toxic direction; in toxification, they align with the toxic direction. These findings suggest that representation steering is the mechanistic driver of intrinsic self-correction. Our analysis highlights that understanding model internals offers a direct route to analyzing the mechanisms of prompt-driven LLM behaviors.
翻译:内在自我修正指语言模型仅通过提示(无需外部反馈或参数更新)即可优化自身输出的现象。尽管该方法能提升多种任务的性能,其作用机制仍不明确。我们通过对齐分析与激活干预证明,内在自我修正通过沿可解释的潜在方向引导隐层表征而实现。为此,我们通过提示诱导的表征偏移分析内在自我修正机制,同时利用对比对构建可解释的潜在方向,并通过激活叠加验证这些方向的因果效应。在六个开源大型语言模型上的评估结果表明:在文本去毒化与文本毒化任务中,提示诱导的表征偏移始终与基于对比对构建的潜在方向保持一致——去毒化任务中偏移与非毒性方向对齐,毒化任务中则与毒性方向对齐。这些发现表明表征引导是内在自我修正的机制驱动因素。我们的分析强调,理解模型内部结构为解析提示驱动的大型语言模型行为机制提供了直接路径。