Concept-based learning improves a deep learning model's interpretability by explaining its predictions via human-understandable concepts. Deep learning models trained under this paradigm heavily rely on the assumption that neural networks can learn to predict the presence or absence of a given concept independently of other concepts. Recent work, however, strongly suggests that this assumption may fail to hold in Concept Bottleneck Models (CBMs), a quintessential family of concept-based interpretable architectures. In this paper, we investigate whether CBMs correctly capture the degree of conditional independence across concepts when such concepts are localised both spatially, by having their values entirely defined by a fixed subset of features, and semantically, by having their values correlated with only a fixed subset of predefined concepts. To understand locality, we analyse how changes to features outside of a concept's spatial or semantic locality impact concept predictions. Our results suggest that even in well-defined scenarios where the presence of a concept is localised to a fixed feature subspace, or whose semantics are correlated to a small subset of other concepts, CBMs fail to learn this locality. These results cast doubt upon the quality of concept representations learnt by CBMs and strongly suggest that concept-based explanations may be fragile to changes outside their localities.
翻译:基于概念的学习通过使用人类可理解的概念来解释深度学习模型的预测,从而增强其可解释性。在此范式下训练的深度学习模型高度依赖于一个假设:神经网络能够独立于其他概念,学习预测给定概念的存在与否。然而,近期研究强烈表明,这一假设在概念瓶颈模型(CBM)这一典型的基于概念的可解释架构中可能不成立。本文探讨了当概念在空间上(其值完全由固定特征子集定义)和语义上(其值仅与固定预定义概念子集相关)实现局部化时,CBM是否能准确捕捉概念间的条件独立性程度。为理解局部性,我们分析了概念空间或语义局部性之外的特征变化如何影响概念预测。结果表明,即使在概念存在性被严格限定于固定特征子空间、或其语义仅与少数其他概念相关的良好定义场景中,CBM仍无法学习这种局部性。这些结果对CBM所学概念表征的质量提出了质疑,并强烈表明基于概念的解释可能对局部性之外的变化缺乏鲁棒性。