Mechanistic interpretability aims to break models into meaningful parts; verifying that two such parts implement the same computation is a prerequisite. Existing similarity measures evaluate either empirical behaviour, leaving them blind to out-of-distribution mechanisms, or basis-dependent parameters, meaning they disregard weight-space symmetries. To address these issues for the class of tensor-based models, we introduce a weight-based metric, tensor similarity, that is invariant to such symmetries. This metric captures global functional equivalence and accounts for cross-layer mechanisms using an efficient recursive algorithm. Empirically, tensor similarity tracks functional training dynamics, such as grokking and backdoor insertion, with higher fidelity than existing metrics. This reduces measuring similarity and verifying faithfulness into a solved algebraic problem rather than one of empirical approximation.
翻译:机制可解释性旨在将模型分解为有意义的组成部分;验证两个这样的部分是否实现相同的计算是其前提。现有相似性度量要么基于经验行为,导致其对分布外机制不敏感,要么依赖基参数,从而忽视了权重空间对称性。为解决这些问题,针对张量模型类,我们引入了一种基于权重的度量——张量相似度,该度量对这些对称性具有不变性。该度量捕获全局功能等价性,并通过高效的递归算法考虑跨层机制。实验表明,与现有度量相比,张量相似度以更高保真度追踪功能训练动态(如“顿悟”现象和后门插入)。这便将测量相似性和验证保真度从一个经验近似问题转化为一个可求解的代数问题。