Recent interpretability methods propose using concept-based explanations to translate the internal representations of deep learning models into a language that humans are familiar with: concepts. This requires understanding which concepts are present in the representation space of a neural network. One popular method for finding concepts is Concept Activation Vectors (CAVs), which are learnt using a probe dataset of concept exemplars. In this work, we investigate three properties of CAVs. CAVs may be: (1) inconsistent between layers, (2) entangled with different concepts, and (3) spatially dependent. Each property provides both challenges and opportunities in interpreting models. We introduce tools designed to detect the presence of these properties, provide insight into how they affect the derived explanations, and provide recommendations to minimise their impact. Understanding these properties can be used to our advantage. For example, we introduce spatially dependent CAVs to test if a model is translation invariant with respect to a specific concept and class. Our experiments are performed on ImageNet and a new synthetic dataset, Elements. Elements is designed to capture a known ground truth relationship between concepts and classes. We release this dataset to facilitate further research in understanding and evaluating interpretability methods.
翻译:近期可解释性方法提出利用基于概念的解释,将深度学习模型的内部表征转化为人类熟悉的概念语言。这要求理解神经网络表征空间中存在哪些概念。寻找概念的流行方法之一是概念激活向量(CAVs),该方法通过概念样本的探测数据集进行学习。本研究探讨了CAVs的三个特性:CAVs可能存在(1)层间不一致性,(2)与其他概念的纠缠性,以及(3)空间依赖性。每个特性在解释模型时既带来挑战也创造机遇。我们引入用于检测这些特性的工具,阐明它们如何影响推导出的解释,并提出最小化其影响的建议。理解这些特性可转化为我们的优势,例如我们利用空间依赖性CAVs测试模型是否针对特定概念和类别具有平移不变性。实验基于ImageNet和新合成数据集Elements进行,该数据集旨在捕捉概念与类别间的已知真实关系。我们公开此数据集以促进可解释性方法的理解与评估研究。