Neural networks are hypothesized to implement interpretable causal mechanisms, yet verifying this requires finding a causal abstraction -- a simpler, high-level Structural Causal Model (SCM) faithful to the network under interventions. Discovering such abstractions is hard: it typically demands brute-force interchange interventions or retraining. We reframe the problem by viewing structured pruning as a search over approximate abstractions. Treating a trained network as a deterministic SCM, we derive an Interventional Risk objective whose second-order expansion yields closed-form criteria for replacing units with constants or folding them into neighbors. Under uniform curvature, our score reduces to activation variance, recovering variance-based pruning as a special case while clarifying when it fails. The resulting procedure efficiently extracts sparse, intervention-faithful abstractions from pretrained networks, which we validate via interchange interventions.
翻译:神经网络被假设为实现可解释的因果机制,但验证此假设需要寻找因果抽象——即在干预下与网络保持一致的更简洁高层结构因果模型。发现此类抽象具有挑战性:通常需要暴力交换干预或重新训练。本研究通过将结构化剪枝视为对近似抽象的搜索来重构该问题。将训练后的网络视为确定性结构因果模型,我们推导出干预风险目标函数,其二阶展开可产生用常数替换单元或将单元折叠至相邻单元的闭式判据。在均匀曲率条件下,该评分简化为激活方差,使基于方差的剪枝成为特例,同时明确了其失效场景。所提出的方法能够从预训练网络中高效提取稀疏且保持干预一致性的抽象,我们通过交换干预实验验证了其有效性。