Interest in understanding and factorizing learned embedding spaces through conceptual explanations is steadily growing. When no human concept labels are available, concept discovery methods search trained embedding spaces for interpretable concepts like object shape or color that can be used to provide post-hoc explanations for decisions. Unlike previous work, we argue that concept discovery should be identifiable, meaning that a number of known concepts can be provably recovered to guarantee reliability of the explanations. As a starting point, we explicitly make the connection between concept discovery and classical methods like Principal Component Analysis and Independent Component Analysis by showing that they can recover independent concepts with non-Gaussian distributions. For dependent concepts, we propose two novel approaches that exploit functional compositionality properties of image-generating processes. Our provably identifiable concept discovery methods substantially outperform competitors on a battery of experiments including hundreds of trained models and dependent concepts, where they exhibit up to 29 % better alignment with the ground truth. Our results provide a rigorous foundation for reliable concept discovery without human labels.
翻译:通过概念解释理解并分解学习到的嵌入空间正逐渐受到关注。当缺乏人工概念标注时,概念发现方法会在训练后的嵌入空间中搜索可解释的概念(如物体形状或颜色),用于为决策提供事后解释。与以往工作不同,我们认为概念发现应当具有可识别性,即能够确保已知的若干概念可被可靠地恢复,以保障解释的可靠性。作为起点,我们通过证明主成分分析和独立成分分析等经典方法可以恢复具有非高斯分布的独立概念,明确建立了概念发现与这些方法之间的联系。针对相关概念,我们提出两种利用图像生成过程功能组合性质的新方法。我们提出的可证明可识别的概念发现方法在包括数百个训练模型和相关概念在内的一系列实验中显著优于竞争对手,其与真实概念的匹配度最高提升29%。我们的研究结果为无需人工标注的可靠概念发现提供了严谨的理论基础。