Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model's decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: there is no precise notion of a valid interpretation; and, generating interpretations is often an ad hoc process. In this paper, we address these challenges by defining and studying the problem of interpretive equivalence: determining whether two different models share a common interpretation, without requiring an explicit description of what that interpretation is. At the core of our approach, we propose and formalize the principle that two interpretations of a model are equivalent if all of their possible implementations are also equivalent. We develop an algorithm to estimate interpretive equivalence and case study its use on Transformer-based models. To analyze our algorithm, we introduce necessary and sufficient conditions for interpretive equivalence based on models' representation similarity. We provide guarantees that simultaneously relate a model's algorithmic interpretations, circuits, and representations. Our framework lays a foundation for the development of more rigorous evaluation methods of MI and automated, generalizable interpretation discovery methods.
翻译:机制可解释性(MI)是解释神经网络的一种新兴框架。针对特定任务和模型,MI旨在发现一个简洁的算法过程(即解释),用以说明模型在该任务上的决策过程。然而,MI难以扩展和泛化。这源于两个关键挑战:缺乏有效解释的精确定义;生成解释往往是一个临时过程。本文通过定义并研究解释等价问题来应对这些挑战:判定两个不同模型是否共享相同解释,而无需显式描述该解释的具体内容。我们提出并形式化了核心原则——当模型的两个解释的所有可能实现都等价时,这两个解释才等价。我们开发了一种估算解释等价的算法,并以Transformer模型为例进行案例研究。为分析该算法,我们基于模型的表征相似性引入了解释等价的必要和充分条件,并提供了同时关联模型算法解释、电路和表征的保证。该框架为开发更严格的MI评估方法及自动化、可泛化的解释发现方法奠定了基础。