Post-hoc explainability methods typically associate each output score of a deep neural network with an input-space direction, most commonly instantiated as the gradient and visualized as a saliency map. However, these approaches often yield explanations that are noisy, lack perceptual alignment and, thus, offer limited interpretability. While many explanation methods attempt to address this issue via modified backward rules or additional heuristics, such approaches are often difficult to justify theoretically and frequently fail basic sanity checks. We introduce Semantic Pullbacks (SP), a faithful and effective post-hoc explanation method for deep neural networks. Semantic Pullbacks address the limitations above by isolating the network's effective linear action via a principled pullback formulation and refining it to recover coherent local structures learned by the target neuron. As a result, SP produces perceptually aligned, class-conditional explanations that highlight meaningful features, support compelling counterfactual perturbations, and admit a clear theoretical motivation. Across standard faithfulness benchmarks, Semantic Pullbacks significantly outperform established attribution methods on both classical convolutional architectures (ResNet50, VGG) and transformer-based models (PVT), while remaining general and computationally efficient. Our method can be easily plugged into existing deep learning pipelines and extended to other modalities.
翻译:事后可解释性方法通常将深度神经网络的每个输出分数与一个输入空间方向相关联,最常见的是通过梯度实现并可视化为显著性图。然而,这些方法产生的解释往往存在噪声、缺乏感知对齐,因而可解释性有限。尽管许多解释方法试图通过修改反向传播规则或引入额外启发式策略来解决这一问题,但此类方法通常难以在理论上得到合理解释,并且常常无法通过基本的合理性检验。我们提出了语义拉回(Semantic Pullbacks,SP),一种忠实且有效的深度神经网络事后解释方法。语义拉回通过一种基于原理的拉回公式来隔离网络的有效线性作用,并对其进行细化以恢复目标神经元学习到的连贯局部结构,从而解决了上述局限性。因此,SP能够产生感知对齐、类别条件性的解释,这些解释能够突出有意义的特征,支持有说服力的反事实扰动,并且具有清晰的理论动机。在标准的忠实性基准测试中,语义拉回在经典卷积架构(ResNet50、VGG)和基于Transformer的模型(PVT)上均显著优于既有的归因方法,同时保持通用性和计算效率。我们的方法可以轻松集成到现有的深度学习流程中,并扩展到其他模态。