Mechanistic interpretability produces circuit-level causal analyses of neural network behaviour, but discovered circuits often remain isolated experimental artefacts: there is no shared formal representation for what circuits compute, how they relate, or when two findings provide evidence for the same mechanism. This work provides a formal infrastructure for cumulative mechanistic science by treating circuit interpretation as inductive theory construction. Each circuit is characterised at two levels: a Causal Functional Signature (CFS), which grounds component behaviour in causal attribution evidence and token role profiles, and an architectural signature $τ_{\mathrm{arch}}$, learned by inductive logic programming (ILP) from scale-invariant structural predicates. Together, these constitute a formal coherence layer that makes mechanistic claims explicit, comparable via $θ$-subsumption, and portable across model scales. CFS reveals qualitatively distinct computational strategies across task types, including attention-mediated copying versus MLP-mediated binding. ILP signatures achieve substantially better structural separation than graph kernel and feature-vector baselines, and support principled transfer across model scales and architecture families.
翻译:机制可解释性为神经网络行为提供了电路级别的因果分析,但发现的电路往往仍是孤立的实验产物:关于电路计算什么、它们之间如何关联,或两个发现何时为同一机制提供证据,并没有共享的形式化表述。本文通过将电路解释视为归纳理论构建,为累积性的机制科学提供了形式化基础设施。每个电路在两层上进行刻画:因果函数签名(CFS),它将组件行为锚定于因果归因证据和标记角色档案;以及一个架构签名τ_arch,通过归纳逻辑编程(ILP)从尺度不变的结构谓词中学习得到。这些共同构成了一个形式化连贯层,使机制主张得以显式表达、通过θ-包容比较,并可跨模型尺度移植。CFS揭示了跨任务类型的质上不同的计算策略,包括注意力介导的复制与MLP介导的绑定。ILP签名比图核与特征向量基线方法实现了显著更好的结构分离,并支持跨模型尺度和架构家族的原则性迁移。