Concept Bottleneck Models (CBMs) have recently been proposed to address the 'black-box' problem of deep neural networks, by first mapping images to a human-understandable concept space and then linearly combining concepts for classification. Such models typically require first coming up with a set of concepts relevant to the task and then aligning the representations of a feature extractor to map to these concepts. However, even with powerful foundational feature extractors like CLIP, there are no guarantees that the specified concepts are detectable. In this work, we leverage recent advances in mechanistic interpretability and propose a novel CBM approach -- called Discover-then-Name-CBM (DN-CBM) -- that inverts the typical paradigm: instead of pre-selecting concepts based on the downstream classification task, we use sparse autoencoders to first discover concepts learnt by the model, and then name them and train linear probes for classification. Our concept extraction strategy is efficient, since it is agnostic to the downstream task, and uses concepts already known to the model. We perform a comprehensive evaluation across multiple datasets and CLIP architectures and show that our method yields semantically meaningful concepts, assigns appropriate names to them that make them easy to interpret, and yields performant and interpretable CBMs. Code available at https://github.com/neuroexplicit-saar/discover-then-name.
翻译:概念瓶颈模型(CBMs)最近被提出用于解决深度神经网络的"黑箱"问题,其首先将图像映射到人类可理解的概念空间,然后通过线性组合概念进行分类。此类模型通常需要先提出一组与任务相关的概念,然后调整特征提取器的表示以映射到这些概念。然而,即使使用像CLIP这样强大的基础特征提取器,也无法保证指定的概念能够被检测到。在本研究中,我们利用机制可解释性领域的最新进展,提出了一种新颖的CBM方法——称为发现后命名CBM(DN-CBM)——该方法颠覆了传统范式:我们不再基于下游分类任务预先选择概念,而是使用稀疏自编码器首先发现模型已学习到的概念,随后对这些概念进行命名并训练线性探针进行分类。我们的概念提取策略具有高效性,因其与下游任务无关,且利用了模型已掌握的概念。我们在多个数据集和CLIP架构上进行了全面评估,结果表明我们的方法能够产生语义明确的概念,为其分配易于解释的适当名称,并构建出高性能且可解释的CBMs。代码发布于https://github.com/neuroexplicit-saar/discover-then-name。