*Automated circuit discovery* is a central tool in mechanistic interpretability for identifying the internal components of neural networks responsible for specific behaviors. While prior methods have made significant progress, they typically depend on heuristics or approximations and do not offer provable guarantees over continuous input domains for the resulting circuits. In this work, we leverage recent advances in neural network verification to propose a suite of automated algorithms that yield circuits with *provable guarantees*. We focus on three types of guarantees: (1) *input domain robustness*, ensuring the circuit agrees with the model across a continuous input region; (2) *robust patching*, certifying circuit alignment under continuous patching perturbations; and (3) *minimality*, formalizing and capturing a wide array of various notions of succinctness. Interestingly, we uncover a diverse set of novel theoretical connections among these three families of guarantees, with critical implications for the convergence of our algorithms. Finally, we conduct experiments with state-of-the-art verifiers on various vision models, showing that our algorithms yield circuits with substantially stronger robustness guarantees than standard circuit discovery methods, establishing a principled foundation for provable circuit discovery.
翻译:*自动化电路发现*是机制可解释性中的核心工具,用于识别神经网络中对特定行为负责的内部组件。尽管先前方法已取得显著进展,但它们通常依赖于启发式方法或近似技术,且无法对所得电路在连续输入域上提供可证明的保证。本工作中,我们利用神经网络验证的最新进展,提出了一套能够产生具备*可证明保证*的自动化算法。我们重点关注三类保证:(1) *输入域鲁棒性*,确保电路在连续输入区域内与模型行为一致;(2) *鲁棒修补*,证明电路在连续修补扰动下的对齐性;(3) *极小性*,形式化并涵盖多种简洁性概念。值得注意的是,我们揭示了这三类保证之间一系列新颖的理论联系,这对我们算法的收敛性具有关键意义。最后,我们在多种视觉模型上使用前沿验证器进行实验,结果表明我们的算法产生的电路比标准电路发现方法具有显著更强的鲁棒性保证,从而为可证明的电路发现奠定了理论基础。