Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits--minimal subnetworks responsible for specific behaviors. However, existing circuit discovery methods are brittle: circuits depend strongly on the chosen concept dataset and often fail to transfer out-of-distribution, raising doubts whether they capture the concept or merely dataset-specific artifacts. We introduce Certified Circuits, which provide provable stability guarantees for circuit discovery. Our framework wraps any black-box discovery algorithm with randomized data subsampling to certify that inclusion decisions over circuit components--neurons or edges of the model graph, depending on the base algorithm--are invariant to bounded edit-distance perturbations of the concept dataset. Unstable components are abstained from, yielding circuits that are more compact and more accurate. We validate across three architectures (ResNet, ViT, GPT-2) on vision (ImageNet and four OOD datasets) and language (IOI, IOI-Hard, Greater-Than) tasks. Certified circuits achieve up to 56% higher accuracy and up to 80% fewer components, and remain reliable where baselines degrade. Certified Circuits puts circuit discovery on formal ground by producing mechanistic explanations that are provably stable and better aligned with the target concept. Code: https://github.com/AlaaAnani/certified-circuits.
翻译:理解神经网络如何得出其预测结果对调试、审计和部署至关重要。机械论可解释性通过识别回路——负责特定行为的最小子网络——来追求这一目标。然而,现有的回路发现方法存在脆弱性:回路高度依赖于所选的概念数据集,且往往无法在分布外场景中迁移,这引发了质疑——它们究竟捕捉到了概念本身,还是仅捕获了数据集特定的伪影。我们提出了可认证回路,为回路发现提供了可证明的稳定性保证。我们的框架通过随机数据子采样包裹任何黑盒发现算法,以认证回路组件(取决于基础算法,为模型图的神经元或边)的包含决策在概念数据集的有界编辑距离扰动下保持不变。不稳定的组件将被弃用,从而产生更紧凑且更准确的回路。我们在三个架构(ResNet、ViT、GPT-2)上,针对视觉(ImageNet及四个OOD数据集)和语言(IOI、IOI-Hard、Greater-Than)任务进行了验证。可认证回路实现了高达56%的准确率提升和高达80%的组件数量减少,并在基线方法性能下降的场景中保持可靠。可认证回路通过生成可证明稳定且与目标概念更对齐的机械论解释,将回路发现置于正式基础之上。代码:https://github.com/AlaaAnani/certified-circuits。