Mechanistic interpretability aims to understand how neural networks generalize beyond their training data by reverse-engineering their internal structures. We introduce patterning as the dual problem: given a desired form of generalization, determine what training data produces it. Our approach is based on susceptibilities, which measure how posterior expectation values of observables respond to infinitesimal shifts in the data distribution. Inverting this linear response relationship yields the data intervention that steers the model toward a target internal configuration. We demonstrate patterning in a small language model, showing that re-weighting training data along principal susceptibility directions can accelerate or delay the formation of structure, such as the induction circuit. In a synthetic parentheses balancing task where multiple algorithms achieve perfect training accuracy, we show that patterning can select which algorithm the model learns by targeting the local learning coefficient of each solution. These results establish that the same mathematical framework used to read internal structure can be inverted to write it.
翻译:机制可解释性旨在通过逆向工程神经网络的内部结构,理解其如何泛化至训练数据之外。我们引入模式化作为其对偶问题:给定期望的泛化形式,确定何种训练数据能够产生该泛化。我们的方法基于敏感性——该指标度量观测量的后验期望值如何响应数据分布的无穷小偏移。逆转这一线性响应关系,即可得到将模型导向目标内部配置的数据干预策略。我们在一个小型语言模型中展示了模式化,证明沿主敏感性方向重新加权训练数据可以加速或延迟特定结构(如归纳电路)的形成。在一个合成括号平衡任务中(多种算法均能达到完美的训练精度),我们通过模式化针对不同解的局部学习系数进行干预,成功选择了模型最终学习的算法。这些结果表明,用于解读内部结构的数学框架同样可被逆转用于写入内部结构。