While induction is considered a key mechanism for in-context learning in LLMs, understanding its precise circuit decomposition beyond toy models remains elusive. Here, we study the emergence of induction behavior within LLMs by probing their response to weak single-token perturbations of the residual stream. We find that LLMs exhibit a robust, universal regime in which their response remains scale-invariant under changes in perturbation strength, thereby allowing us to quantify the build-up of token correlations throughout the model. By applying our method, we observe signatures of induction behavior within the residual stream of Gemma-2-2B, Llama-3.2-3B, and GPT-2-XL. Across all models, we find that these induction signatures gradually emerge within intermediate layers and identify the relevant model sections composing this behavior. Our results provide insights into the collective interplay of components within LLMs and serve as a benchmark for large-scale circuit analysis.
翻译:尽管归纳被认为是LLMs上下文学习的关键机制,但超越玩具模型、对其精确电路分解的理解仍然难以捉摸。本文通过探测LLMs对残差流中弱单令牌扰动的响应,研究了其归纳行为的涌现。我们发现LLMs展现出一个鲁棒的、普遍的响应机制:在扰动强度变化下,其响应保持尺度不变性,从而允许我们量化模型中令牌相关性的建立过程。应用我们的方法,我们在Gemma-2-2B、Llama-3.2-3B和GPT-2-XL的残差流中观察到了归纳行为的特征。在所有模型中,我们发现这些归纳特征在中间层逐渐涌现,并识别出构成该行为的相关模型部分。我们的研究结果为理解LLMs内部组件的集体相互作用提供了见解,并为大规模电路分析提供了基准。