Circuit discovery is a key technique in mechanistic interpretability to pinpoint the model components that are crucial for performing a given task. Although the current state-of-the-art method (EAP-IG) performs well on the metric of (un)faithfulness, it suffers from substantial variability. This includes resampling variance, where the circuit changes when we probe with a new batch of data from the same distribution; rephrasing variance, where the discovered circuit shifts when the prompts are rephrased; and sample-wise variance, where a circuit with low population unfaithfulness exhibits large fluctuations in unfaithfulness across individual samples. This paper studies the roots of these variances. We demonstrate that CEAP, our new circuit discovery method that improves upon EAP-IG with a theoretical guarantee, can substantially lessen resampling variance. We further show that rephrasing variance arises because prompts with different templates tend to activate different circuits in the model. This leads us to argue that it may be challenging to find a comprehensive circuit that explains and controls the model's behavior on a task, which can be expressed in countless templates, suggesting that LLMs may be inherently hard to steer. We show that sparsity, which has been claimed to form more compact and interpretable task circuits, fails to solve this problem. Regarding sample-wise variance, we argue that it is largely benign: extremely poor unfaithfulness scores often stem from how unfaithfulness is defined, rather than from defects in the measured circuits. We show that the magnitude of unfaithfulness is affected by selective contribution scaling, a neural mechanism that accounts for the extremely poor scores sometimes observed.
翻译:电路发现是机制可解释性中的一项关键技术,旨在精确定位对执行特定任务至关重要的模型组件。尽管当前最先进的方法(EAP-IG)在(不)忠实度指标上表现良好,但它存在显著的变异性。这包括重采样方差(当使用来自同一分布的新批次数据进行探测时,电路会发生变化)、改写方差(当提示被改写时,发现的电路会发生偏移)以及样本-wise方差(一个具有较低群体不忠实度的电路,在单个样本上的不忠实度却表现出大幅波动)。本文研究了这些方差的根源。我们证明,CEAP(我们提出的新电路发现方法,在理论上改进了EAP-IG)能够显著减轻重采样方差。我们进一步表明,改写方差产生的原因是不同模板的提示倾向于激活模型中不同的电路。这使我们认识到,找到一个能够解释并控制模型在某一任务上行为的全面电路可能具有挑战性,因为该任务可以用无数模板来表达,这暗示大语言模型可能天生难以被引导。我们表明,稀疏性(曾被声称能形成更紧凑且可解释的任务电路)无法解决这个问题。关于样本-wise方差,我们认为它很大程度上是良性的:极差的不忠实度分数通常源于不忠实度的定义方式,而非所测量电路本身的缺陷。我们证明,不忠实度的大小受选择性贡献缩放(一种神经机制)的影响,该机制解释了有时观察到的极低分数。