While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.
翻译:尽管机制可解释性研究已在大语言模型中发现可解释电路,但其在训练数据中的因果起源仍不明确。我们提出机制性数据归因(MDA),这是一个可扩展的框架,利用影响函数将可解释单元溯源至特定训练样本。通过对Pythia模型系列的广泛实验,我们因果验证了定向干预——移除或增强少量高影响力样本——能显著调控可解释注意力头的形成,而随机干预则无此效果。分析表明,重复性结构化数据(如LaTeX、XML)发挥着机制催化剂的作用。此外,我们发现针对归纳头形成的干预会同步改变模型的上下文学习能力,这为关于归纳头与上下文学习功能关联的长期假设提供了直接因果证据。最后,我们提出一种机制性数据增强流程,可在不同模型规模上持续加速电路收敛,为引导大语言模型发展轨迹提供了原理性方法论。