Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits - minimal subnetworks responsible for specific behaviors. However, existing circuit discovery methods are brittle: circuits depend strongly on the chosen concept dataset and often fail to transfer out-of-distribution, raising doubts whether they capture concept or dataset-specific artifacts. We introduce Certified Circuits, which provide provable stability guarantees for circuit discovery. Our framework wraps any black-box discovery algorithm with randomized data subsampling to certify that circuit component inclusion decisions are invariant to bounded edit-distance perturbations of the concept dataset. Unstable neurons are abstained from, yielding circuits that are more compact and more accurate. On ImageNet and OOD datasets, certified circuits achieve up to 91% higher accuracy while using 45% fewer neurons, and remain reliable where baselines degrade. Certified Circuits puts circuit discovery on formal ground by producing mechanistic explanations that are provably stable and better aligned with the target concept. Code will be released soon!
翻译:理解神经网络如何得出其预测对于调试、审计和部署至关重要。机械可解释性通过识别电路——负责特定行为的最小子网络——来追求这一目标。然而,现有的电路发现方法具有脆弱性:电路强烈依赖于所选的概念数据集,并且常常无法在分布外泛化,这引发了对它们捕获的是概念还是数据集特定伪影的质疑。我们引入了认证电路,为电路发现提供了可证明的稳定性保证。我们的框架将任何黑盒发现算法与随机数据子采样相结合,以证明电路组件包含决策对于概念数据集的有界编辑距离扰动是不变的。不稳定的神经元被弃用,从而产生更紧凑、更准确的电路。在ImageNet和分布外数据集上,认证电路实现了高达91%的准确率提升,同时减少了45%的神经元使用,并在基线方法性能下降时保持可靠。认证电路通过产生可证明稳定且与目标概念更对齐的机械解释,将电路发现置于形式化基础之上。代码即将发布!