Concept-based interpretability methods offer a lens into the internals of foundation models by decomposing their embeddings into high-level concepts. These concept representations are most useful when they are compositional, meaning that the individual concepts compose to explain the full sample. We show that existing unsupervised concept extraction methods find concepts which are not compositional. To automatically discover compositional concept representations, we identify two salient properties of such representations, and propose Compositional Concept Extraction (CCE) for finding concepts which obey these properties. We evaluate CCE on five different datasets over image and text data. Our evaluation shows that CCE finds more compositional concept representations than baselines and yields better accuracy on four downstream classification tasks. Code and data are available at https://github.com/adaminsky/compositional_concepts .
翻译:基于概念的可解释性方法通过将基础模型的嵌入分解为高层概念,为理解其内部机制提供了视角。当这些概念表示具有组合性时最为有用,即单个概念能够组合起来解释完整样本。我们证明现有的无监督概念提取方法所发现的概念不具备组合性。为自动发现组合性概念表示,我们识别了此类表示的两个显著特性,并提出组合性概念提取(CCE)方法来寻找符合这些特性的概念。我们在图像和文本数据的五个不同数据集上评估CCE。实验结果表明,相较于基线方法,CCE能发现更具组合性的概念表示,并在四项下游分类任务中取得更高的准确率。代码与数据详见 https://github.com/adaminsky/compositional_concepts。