Mechanistic Interpretability (MI) aims to reverse-engineer model behaviors by identifying functional sub-networks. Yet, the scientific validity of these findings depends on their stability. In this work, we argue that circuit discovery is not a standalone task but a statistical estimation problem built upon causal mediation analysis (CMA). We uncover a fundamental instability at this base layer: exact, single-input CMA scores exhibit high intrinsic variance, implying that the causal effect of a component is a volatile random variable rather than a fixed property. We then demonstrate that circuit discovery pipelines inherit this variance and further amplify it. Fast approximation methods, such as Edge Attribution Patching and its successors, introduce additional estimation noise, while aggregating these noisy scores over datasets leads to fragile structural estimates. Consequently, small perturbations in input data or hyperparameters yield vastly different circuits. We systematically decompose these sources of variance and advocate for more rigorous MI practices, prioritizing statistical robustness and routine reporting of stability metrics.
翻译:机制可解释性(MI)旨在通过识别功能性子网络来逆向工程模型行为。然而,这些发现科学有效性取决于其稳定性。在本工作中,我们认为电路发现并非独立任务,而是建立在因果中介分析(CMA)之上的统计估计问题。我们揭示了该基础层存在一个根本性不稳定因素:精确的单输入CMA分数表现出高固有方差,这意味着组件的因果效应是波动随机变量而非固定属性。我们进而证明电路发现流程继承了这种方差并进一步放大它。快速近似方法(如Edge Attribution Patching及其后续方法)会引入额外估计噪声,而在数据集上聚合这些含噪分数会导致脆弱的结构估计。因此,输入数据或超参数的微小扰动会产生截然不同的电路。我们系统分解了这些方差来源,并倡导采用更严谨的MI实践,优先考虑统计鲁棒性及稳定性指标的常规报告。