We propose Automatic Feature Explanation using Contrasting Concepts (FALCON), an interpretability framework to explain features of image representations. For a target feature, FALCON captions its highly activating cropped images using a large captioning dataset (like LAION-400m) and a pre-trained vision-language model like CLIP. Each word among the captions is scored and ranked leading to a small number of shared, human-understandable concepts that closely describe the target feature. FALCON also applies contrastive interpretation using lowly activating (counterfactual) images, to eliminate spurious concepts. Although many existing approaches interpret features independently, we observe in state-of-the-art self-supervised and supervised models, that less than 20% of the representation space can be explained by individual features. We show that features in larger spaces become more interpretable when studied in groups and can be explained with high-order scoring concepts through FALCON. We discuss how extracted concepts can be used to explain and debug failures in downstream tasks. Finally, we present a technique to transfer concepts from one (explainable) representation space to another unseen representation space by learning a simple linear transformation. Code available at https://github.com/NehaKalibhat/falcon-explain.
翻译:我们提出利用对比概念自动解释特征(FALCON)框架,这是一种用于解释图像表示特征的可解释性框架。对于目标特征,FALCON利用大型字幕数据集(如LAION-400m)和预训练视觉-语言模型(如CLIP)为其高激活裁剪图像生成描述性字幕。对字幕中的每个词进行评分和排序后,得到少量共享的、人类可理解的概念,这些概念能准确描述目标特征。FALCON还通过低激活(反事实)图像进行对比式解释,以消除虚假概念。尽管现有方法多独立解释特征,但我们观察到,在先进的自主监督和监督模型中,仅不到20%的表示空间可由单个特征解释。研究表明,当特征在更大空间中分组研究时,其可解释性增强,并可通过FALCON的高阶评分概念进行解释。我们探讨了如何利用提取的概念解释并调试下游任务中的失败案例。最后,我们提出一种技术,通过学习简单的线性变换,将概念从一种(可解释的)表示空间迁移到另一种未见过的表示空间。代码见 https://github.com/NehaKalibhat/falcon-explain。