Concept Bottleneck Models (CBMs) are a relevant tool for explainable Artificial Intelligence because they make their predictions through human-interpretable symbols. However, high task accuracy does not guarantee that these symbols are detected faithfully: jointly trained CBMs may encode task-specific shortcuts in the bottleneck, making their explanations unreliable. In this paper, we study concept-detection reliability by swapping independently trained concept detectors and classification heads that share the same symbolic vocabulary. We use the resulting performance degradation, concept-level metrics, and symbol-wise uncertainty estimates to identify concepts that are especially prone to spurious firing. Finally, we propose a reliability-aware training strategy in which a shared concept detector is optimized with multiple classification heads and penalized for relying on globally or instance-wise unreliable symbols. On CUB-200-2011 with full concept supervision, detectors and heads are almost freely interchangeable (swap drop below one accuracy point, relative retention above $99\%$, and no concept detected below chance), whereas on a controlled synthetic task we show that, as the concept-supervision weight is reduced, models keep near-perfect task accuracy while swapped accuracy and agreement with the ground-truth concepts collapse to chance. Our reliability-aware training substantially mitigates this leakage, roughly doubling swap accuracy in the leaky regime.
翻译:概念瓶颈模型(CBMs)是可解释人工智能的重要工具,因为它们通过人类可理解的符号进行预测。然而,高任务准确率并不能保证这些符号被可靠地检测到:联合训练的CBM可能会在瓶颈层中编码任务特定的捷径,从而导致其解释不可靠。本文通过交换共享相同符号词汇的独立训练的概念检测器与分类头,研究了概念检测的可靠性。我们利用由此产生的性能退化、概念级指标以及符号级不确定性估计,来识别特别容易产生虚假激活的概念。最后,我们提出了一种可靠性感知训练策略,该策略使用多个分类头优化共享概念检测器,并对依赖全局或实例级不可靠符号的行为施以惩罚。在具有完整概念监督的CUB-200-2011数据集上,检测器和分类头几乎可以自由互换(交换后准确率下降低于一个百分点,相对保留率高于99%,且无概念检测低于随机水平);而在受控的合成任务中,我们发现随着概念监督权重的降低,模型保持近乎完美的任务准确率,但交换准确率以及与真实概念的一致性却骤降至随机水平。我们的可靠性感知训练显著缓解了这种信息泄漏,在泄漏区域中交换准确率大约提升了一倍。