An important line of research attempts to explain CNN image classifier predictions and intermediate layer representations in terms of human understandable concepts. In this work, we expand on previous works in the literature that use annotated concept datasets to extract interpretable feature space directions and propose an unsupervised post-hoc method to extract a disentangling interpretable basis by looking for the rotation of the feature space that explains sparse one-hot thresholded transformed representations of pixel activations. We do experimentation with existing popular CNNs and demonstrate the effectiveness of our method in extracting an interpretable basis across network architectures and training datasets. We make extensions to the existing basis interpretability metrics found in the literature and show that, intermediate layer representations become more interpretable when transformed to the bases extracted with our method. Finally, using the basis interpretability metrics, we compare the bases extracted with our method with the bases derived with a supervised approach and find that, in one aspect, the proposed unsupervised approach has a strength that constitutes a limitation of the supervised one and give potential directions for future research.
翻译:一项重要的研究方向尝试从人类可理解的概念角度解释CNN图像分类器的预测结果及中间层表示。本研究在现有文献中利用标注概念数据集提取可解释特征空间方向的工作基础上,提出了一种无监督后处理方法——通过寻找能解释像素激活值稀疏独热阈值化变换表示的特征空间旋转,来提取解耦可解释基元。我们对现有主流CNN进行实验,验证了该方法在不同网络架构和训练数据集上提取可解释基元的有效性。我们扩展了文献中现有的基元可解释性度量指标,结果表明中间层表示经本方法提取的基元变换后具有更高的可解释性。最后,利用基元可解释性度量指标,我们将本方法提取的基元与监督方法导出的基元进行比较,发现本方法在某个方面具有监督方法无法企及的优势,并给出了未来研究的潜在方向。