Neural network models are widely used in a variety of domains, often as black-box solutions, since they are not directly interpretable for humans. The field of explainable artificial intelligence aims at developing explanation methods to address this challenge, and several approaches have been developed over the recent years, including methods for investigating what type of knowledge these models internalise during the training process. Among these, the method of concept detection, investigates which \emph{concepts} neural network models learn to represent in order to complete their tasks. In this work, we present an extension to the method of concept detection, named \emph{concept backpropagation}, which provides a way of analysing how the information representing a given concept is internalised in a given neural network model. In this approach, the model input is perturbed in a manner guided by a trained concept probe for the described model, such that the concept of interest is maximised. This allows for the visualisation of the detected concept directly in the input space of the model, which in turn makes it possible to see what information the model depends on for representing the described concept. We present results for this method applied to a various set of input modalities, and discuss how our proposed method can be used to visualise what information trained concept probes use, and the degree as to which the representation of the probed concept is entangled within the neural network model itself.
翻译:神经网络模型被广泛应用于各个领域,往往作为黑盒解决方案,因为它们无法直接被人类解释。可解释人工智能领域致力于开发解释方法来应对这一挑战,近年来已发展出多种方法,包括探究模型在训练过程中内化了哪些知识的分析方法。其中,概念检测方法研究神经网络模型为完成任务而学习表征的"概念"。本文提出概念检测方法的扩展——"概念反向传播",提供了一种分析特定概念在神经网络模型中如何被内化的方法。该方法通过训练好的概念探针对模型输入进行引导式扰动,以最大化目标概念的表征。这使得检测到的概念可直接在模型输入空间中进行可视化,从而揭示模型表征该概念时所依赖的信息。我们展示了该方法在多种输入模态上的应用结果,并讨论了所提方法如何可视化训练后的概念探针所利用的信息,以及被探测概念的表征在神经网络模型本身中的纠缠程度。