Certified Circuits: Stability Guarantees for Mechanistic Circuits

Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits--minimal subnetworks responsible for specific behaviors. However, existing circuit discovery methods are brittle: circuits depend strongly on the chosen concept dataset and often fail to transfer out-of-distribution, raising doubts whether they capture the concept or merely dataset-specific artifacts. We introduce Certified Circuits, which provide provable stability guarantees for circuit discovery. Our framework wraps any black-box discovery algorithm with randomized data subsampling to certify that inclusion decisions over circuit components--neurons or edges of the model graph, depending on the base algorithm--are invariant to bounded edit-distance perturbations of the concept dataset. Unstable components are abstained from, yielding circuits that are more compact and more accurate. We validate across three architectures (ResNet, ViT, GPT-2) on vision (ImageNet and four OOD datasets) and language (IOI, IOI-Hard, Greater-Than) tasks. Certified circuits achieve up to 56% higher accuracy and up to 80% fewer components, and remain reliable where baselines degrade. Certified Circuits puts circuit discovery on formal ground by producing mechanistic explanations that are provably stable and better aligned with the target concept. Code: https://github.com/AlaaAnani/certified-circuits.

翻译：理解神经网络如何得出其预测结果对调试、审计和部署至关重要。机械论可解释性通过识别回路——负责特定行为的最小子网络——来追求这一目标。然而，现有的回路发现方法存在脆弱性：回路高度依赖于所选的概念数据集，且往往无法在分布外场景中迁移，这引发了质疑——它们究竟捕捉到了概念本身，还是仅捕获了数据集特定的伪影。我们提出了可认证回路，为回路发现提供了可证明的稳定性保证。我们的框架通过随机数据子采样包裹任何黑盒发现算法，以认证回路组件（取决于基础算法，为模型图的神经元或边）的包含决策在概念数据集的有界编辑距离扰动下保持不变。不稳定的组件将被弃用，从而产生更紧凑且更准确的回路。我们在三个架构（ResNet、ViT、GPT-2）上，针对视觉（ImageNet及四个OOD数据集）和语言（IOI、IOI-Hard、Greater-Than）任务进行了验证。可认证回路实现了高达56%的准确率提升和高达80%的组件数量减少，并在基线方法性能下降的场景中保持可靠。可认证回路通过生成可证明稳定且与目标概念更对齐的机械论解释，将回路发现置于正式基础之上。代码：https://github.com/AlaaAnani/certified-circuits。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

ICML 2025 关于语言模型机械可解释性的教程

专知会员服务

18+阅读 · 2025年7月25日

【MIT博士论文】基于数据的模型可靠性视角，322页pdf

专知会员服务

39+阅读 · 2024年3月25日

《人工智能系统工程保障的系统理论方法》2023最新77页论文

专知会员服务

41+阅读 · 2023年12月14日

卷积神经网络的可解释性研究综述

专知会员服务

91+阅读 · 2023年6月5日