Causal Interpretation of Neural Network Computations with Contribution Decomposition

Understanding how neural networks transform inputs into outputs is crucial for interpreting and manipulating their behavior. Most existing approaches analyze internal representations by identifying hidden-layer activation patterns correlated with human-interpretable concepts. Here we take a direct approach to examine how hidden neurons act to drive network outputs. We introduce CODEC (Contribution Decomposition), a method that uses sparse autoencoders to decompose network behavior into sparse motifs of hidden-neuron contributions, revealing causal processes that cannot be determined by analyzing activations alone. Applying CODEC to benchmark image-classification networks, we find that contributions grow in sparsity and dimensionality across layers and, unexpectedly, that they progressively decorrelate positive and negative effects on network outputs. We further show that decomposing contributions into sparse modes enables greater control and interpretation of intermediate layers, supporting both causal manipulations of network output and human-interpretable visualizations of distinct image components that combine to drive that output. Finally, by analyzing state-of-the-art models of neural activity in the vertebrate retina, we demonstrate that CODEC uncovers combinatorial actions of model interneurons and identifies the sources of dynamic receptive fields. Overall, CODEC provides a rich and interpretable framework for understanding how nonlinear computations evolve across hierarchical layers, establishing contribution modes as an informative unit of analysis for mechanistic insights into artificial neural networks.

翻译：理解神经网络如何将输入转化为输出对于解释和操纵其行为至关重要。现有方法大多通过识别与人类可解释概念相关的隐藏层激活模式来分析内部表征。本文采用直接方法研究隐藏神经元如何驱动网络输出。我们提出CODEC（贡献分解）方法，该方法利用稀疏自编码器将网络行为分解为隐藏神经元贡献的稀疏基元，从而揭示仅通过分析激活无法确定的因果过程。将CODEC应用于基准图像分类网络时，我们发现贡献在跨层中呈现稀疏性和维度增长，且出人意料的是，它们逐渐解耦了对网络输出的正负效应。我们进一步证明，将贡献分解为稀疏模态能够增强对中间层的控制和解释，既支持对网络输出的因果操纵，也能对人类可解释的不同图像组件进行可视化展示——这些组件共同驱动了网络输出。最后，通过分析脊椎动物视网膜神经活动的最先进模型，我们证明CODEC能够揭示模型中间神经元的组合作用，并识别动态感受野的来源。总体而言，CODEC为理解非线性计算如何在层级结构中演化提供了丰富且可解释的框架，确立了贡献模态作为分析单元的价值，为人工神经网络的机制性研究提供了新的见解。