Language-aligned vision foundation models perform strongly across diverse downstream tasks. Yet, their learned representations remain opaque, making interpreting their decision-making hard. Recent works decompose these representations into human-interpretable concepts, but provide poor spatial grounding and are limited to image classification tasks. In this work, we propose Insight, a language-aligned concept foundation model that provides fine-grained concepts, which are human-interpretable and spatially grounded in the input image. We leverage a hierarchical sparse autoencoder and a foundation model with strong semantic representations to automatically extract concepts at various granularities. Examining local co-occurrence dependencies of concepts allows us to define concept relationships. Through these relations we further improve concept naming and obtain richer explanations. On benchmark data, we show that Insight provides performance on classification and segmentation that is competitive with opaque foundation models while providing fine-grained, high quality concept-based explanations. Code is available at https://github.com/kawi19/Insight.
翻译:语言对齐的视觉基础模型在多种下游任务中表现优异。然而,其学习到的表征仍然不透明,使得解释其决策过程变得困难。近期研究将这些表征分解为人类可解释的概念,但提供的空间定位能力较差,且仅限于图像分类任务。在本工作中,我们提出Insight,一个语言对齐的概念基础模型,它提供细粒度、人类可解释且在输入图像中具有空间定位的概念。我们利用分层稀疏自编码器和一个具有强语义表征能力的基础模型,自动提取不同粒度的概念。通过检查概念的局部共现依赖关系,我们能够定义概念间的关系。借助这些关系,我们进一步改进了概念命名并获得了更丰富的解释。在基准数据上,我们表明Insight在分类和分割任务上的性能可与不透明的基础模型相媲美,同时提供细粒度、高质量的概念解释。代码发布于 https://github.com/kawi19/Insight。