Language-aligned vision foundation models perform strongly across diverse downstream tasks. Yet, their learned representations remain opaque, making interpreting their decision-making difficult. Recent work decompose these representations into human-interpretable concepts, but provide poor spatial grounding and are limited to image classification tasks. In this work, we propose CFM, a language-aligned concept foundation model for vision that provides fine-grained concepts, which are human-interpretable and spatially grounded in the input image. When paired with a foundation model with strong semantic representations, we get explanations for any of its downstream tasks. Examining local co-occurrence dependencies of concepts allows us to define concept relationships through which we improve concept naming and obtain richer explanations. On benchmark data, we show that CFM provides performance on classification, segmentation, and captioning that is competitive with opaque foundation models while providing fine-grained, high quality concept-based explanations. Code at https://github.com/kawi19/CFM.
翻译:语言对齐的视觉基础模型在多种下游任务中表现出色。然而,其学习到的表征仍然不透明,使得解释其决策过程变得困难。近期研究尝试将这些表征分解为人类可解释的概念,但提供的空间定位能力较弱,且仅限于图像分类任务。在本工作中,我们提出了CFM,一种面向视觉的语言对齐概念基础模型,它能够提供细粒度、人类可解释且在输入图像中具有空间定位的概念。当与具有强大语义表征的基础模型结合时,我们可以为其任何下游任务提供解释。通过分析概念间的局部共现依赖关系,我们能够定义概念间的关系,从而改进概念命名并获得更丰富的解释。在基准数据上的实验表明,CFM在分类、分割和描述生成任务上的性能可与不透明的基础模型相媲美,同时能够提供细粒度、高质量、基于概念的解释。代码位于 https://github.com/kawi19/CFM。