Deep Neural Networks (DNNs) demonstrate remarkable capabilities in learning complex hierarchical data representations, but the nature of these representations remains largely unknown. Existing global explainability methods, such as Network Dissection, face limitations such as reliance on segmentation masks, lack of statistical significance testing, and high computational demands. We propose Inverse Recognition (INVERT), a scalable approach for connecting learned representations with human-understandable concepts by leveraging their capacity to discriminate between these concepts. In contrast to prior work, INVERT is capable of handling diverse types of neurons, exhibits less computational complexity, and does not rely on the availability of segmentation masks. Moreover, INVERT provides an interpretable metric assessing the alignment between the representation and its corresponding explanation and delivering a measure of statistical significance. We demonstrate the applicability of INVERT in various scenarios, including the identification of representations affected by spurious correlations, and the interpretation of the hierarchical structure of decision-making within the models.
翻译:深度神经网络(DNNs)在学习复杂层次数据表示方面展现出显著能力,但这些表示的本质在很大程度上仍未知。现有全局可解释性方法(如网络剖析)存在局限性,例如依赖分割掩码、缺乏统计显著性检验以及计算成本高昂。我们提出反向识别(INVERT),这是一种可扩展的方法,通过利用学习表示区分人类可理解概念的能力,将这些表示与概念联系起来。与先前工作相比,INVERT能够处理多种类型的神经元,计算复杂度更低,且无需依赖分割掩码。此外,INVERT提供了一种可解释的度量,用于评估表示与其对应解释之间的一致性,并给出统计显著性的衡量指标。我们展示了INVERT在多种场景中的适用性,包括识别受虚假相关性影响的表示,以及解释模型内部决策过程的层次结构。