Certified Circuits: Stability Guarantees for Mechanistic Circuits

Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits - minimal subnetworks responsible for specific behaviors. However, existing circuit discovery methods are brittle: circuits depend strongly on the chosen concept dataset and often fail to transfer out-of-distribution, raising doubts whether they capture concept or dataset-specific artifacts. We introduce Certified Circuits, which provide provable stability guarantees for circuit discovery. Our framework wraps any black-box discovery algorithm with randomized data subsampling to certify that circuit component inclusion decisions are invariant to bounded edit-distance perturbations of the concept dataset. Unstable neurons are abstained from, yielding circuits that are more compact and more accurate. On ImageNet and OOD datasets, certified circuits achieve up to 91% higher accuracy while using 45% fewer neurons, and remain reliable where baselines degrade. Certified Circuits puts circuit discovery on formal ground by producing mechanistic explanations that are provably stable and better aligned with the target concept. Code will be released soon!

翻译：理解神经网络如何得出其预测对于调试、审计和部署至关重要。机制可解释性通过识别电路——即负责特定行为的最小子网络——来追求这一目标。然而，现有的电路发现方法具有脆弱性：电路在很大程度上依赖于所选的概念数据集，并且常常无法在分布外泛化，这引发了人们对其捕获的是概念还是数据集特定伪影的质疑。我们引入了认证电路，为电路发现提供了可证明的稳定性保证。我们的框架将任何黑盒发现算法与随机数据子采样相结合，以证明电路组件包含决策对于概念数据集的有界编辑距离扰动是不变的。不稳定的神经元被弃用，从而产生更紧凑、更准确的电路。在ImageNet和分布外数据集上，认证电路实现了高达91%的更高准确率，同时使用的神经元减少了45%，并且在基线方法性能下降时仍保持可靠。认证电路通过产生可证明稳定且与目标概念更对齐的机制性解释，为电路发现奠定了形式化基础。代码即将发布！

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

CoLiDR: 使用聚合解缠表示进行概念学习

专知会员服务

15+阅读 · 2024年8月21日

【ETHZ博士论文】神经网络训练与认证，101页pdf

专知会员服务

20+阅读 · 2024年7月28日

【ETHZ博士论文】认证神经网络的表达能力，86页pdf

专知会员服务

20+阅读 · 2024年6月16日

【伯克利博士论文】神经网络中的结构与表征

专知会员服务

49+阅读 · 2024年5月12日