Neural pathways as model explanations consist of a sparse set of neurons that provide the same level of prediction performance as the whole model. Existing methods primarily focus on accuracy and sparsity but the generated pathways may offer limited interpretability thus fall short in explaining the model behavior. In this paper, we suggest two interpretability criteria of neural pathways: (i) same-class neural pathways should primarily consist of class-relevant neurons; (ii) each instance's neural pathway sparsity should be optimally determined. To this end, we propose a Generative Class-relevant Neural Pathway (GEN-CNP) model that learns to predict the neural pathways from the target model's feature maps. We propose to learn class-relevant information from features of deep and shallow layers such that same-class neural pathways exhibit high similarity. We further impose a faithfulness criterion for GEN-CNP to generate pathways with instance-specific sparsity. We propose to transfer the class-relevant neural pathways to explain samples of the same class and show experimentally and qualitatively their faithfulness and interpretability.
翻译:神经通路作为模型解释方法,由一组稀疏的神经元组成,其预测性能与完整模型相当。现有方法主要关注准确性和稀疏性,但生成的神经通路可能可解释性有限,因而难以有效解释模型行为。本文提出神经通路的两项可解释性准则:(i)同类神经通路应主要由类别相关神经元组成;(ii)每个实例的神经通路稀疏性应最优确定。为此,我们提出生成式类别相关神经通路(GEN-CNP)模型,该模型学习从目标模型的特征图中预测神经通路。我们提出从深层和浅层特征中学习类别相关信息,使得同类神经通路具有高度相似性。我们进一步为GEN-CNP施加保真度准则,以生成具有实例特定稀疏性的通路。我们提出将类别相关神经通路迁移至解释同类样本,并通过实验和定性分析证明其保真度与可解释性。