Universality is a key hypothesis in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks. In this work, we study the universality hypothesis by examining how small neural networks learn to implement group composition. We present a novel algorithm by which neural networks may implement composition for any finite group via mathematical representation theory. We then show that networks consistently learn this algorithm by reverse engineering model logits and weights, and confirm our understanding using ablations. By studying networks of differing architectures trained on various groups, we find mixed evidence for universality: using our algorithm, we can completely characterize the family of circuits and features that networks learn on this task, but for a given network the precise circuits learned -- as well as the order they develop -- are arbitrary.
翻译:普适性是机制可解释性中的关键假设——即不同模型在类似任务上训练时会学习到相似的特征和电路。本研究通过考察小型神经网络如何实现群复合运算来探讨普适性假设。我们基于数学表示论提出一种新算法,使神经网络能够对任意有限群实现复合运算。通过逆向工程分析模型对数几率与权重,我们证明网络始终能学习该算法,并利用消融实验验证了理解正确性。通过研究在不同群上训练的不同架构网络,我们发现了支持普适性的混合证据:利用该算法可完整刻画网络在此任务上学习的电路与特征族,但特定网络所习得的具体电路及其发展阶段均存在任意性。