The Clock and Pizza interpretations, associated with architectures differing in either uniform or learnable attention, were introduced to argue that different architectural designs can yield distinct circuits for modular addition. In this work, we show that this is not the case, and that both uniform attention and trainable attention architectures implement the same algorithm via topologically and geometrically equivalent representations. Our methodology goes beyond the interpretation of individual neurons and weights. Instead, we identify all of the neurons corresponding to each learned representation and then study the collective group of neurons as one entity. This method reveals that each learned representation is a manifold that we can study utilizing tools from topology. Based on this insight, we can statistically analyze the learned representations across hundreds of circuits to demonstrate the similarity between learned modular addition circuits that arise naturally from common deep learning paradigms.
翻译:针对具有均匀注意力与可学习注意力两种不同架构所提出的Clock与Pizza解释,曾被用于论证不同架构设计可产生不同的模加法计算回路。本研究表明事实并非如此:均匀注意力架构与可训练注意力架构实际上通过拓扑与几何等价的表示形式实现了相同的算法。我们的研究方法超越了对单个神经元与权重的解释,而是识别出每个习得表示对应的全部神经元,进而将神经元集合作为整体进行研究。该方法揭示每个习得表示均为一个流形,使我们能够运用拓扑学工具进行分析。基于这一洞见,我们通过对数百个计算回路中习得表示的统计分析,证明了从常见深度学习范式中自然产生的模加法习得回路具有高度相似性。