Many interpretable AI approaches have been proposed to provide plausible explanations for a model's decision-making. However, configuring an explainable model that effectively communicates among computational modules has received less attention. A recently proposed shared global workspace theory showed that networks of distributed modules can benefit from sharing information with a bottlenecked memory because the communication constraints encourage specialization, compositionality, and synchronization among the modules. Inspired by this, we propose Concept-Centric Transformers, a simple yet effective configuration of the shared global workspace for interpretability, consisting of: i) an object-centric-based memory module for extracting semantic concepts from input features, ii) a cross-attention mechanism between the learned concept and input embeddings, and iii) standard classification and explanation losses to allow human analysts to directly assess an explanation for the model's classification reasoning. We test our approach against other existing concept-based methods on classification tasks for various datasets, including CIFAR100, CUB-200-2011, and ImageNet, and we show that our model achieves better classification accuracy than all baselines across all problems but also generates more consistent concept-based explanations of classification output.
翻译:许多可解释人工智能方法已被提出,旨在为模型决策提供合理的解释。然而,如何配置一个能在计算模块间有效沟通的可解释模型却较少受到关注。近期提出的共享全局工作空间理论表明,分布式模块网络可通过共享瓶颈记忆而获益,因为通信约束促进了模块间的专门化、组合性和同步性。受此启发,我们提出概念中心Transformer——一种简单而有效的共享全局工作空间可解释性配置方案,其由三部分组成:i)基于对象中心的记忆模块,用于从输入特征中提取语义概念;ii)学习到的概念与输入嵌入之间的交叉注意力机制;iii)标准分类损失与解释损失,使人类分析师能够直接评估模型分类推理过程的解释。我们在CIFAR100、CUB-200-2011和ImageNet等多个数据集的分类任务中,将本方法与其他基于概念的方法进行了对比测试,结果表明我们的模型不仅在所有基准问题上取得更优的分类准确率,而且能生成更一致的基于概念的分类输出解释。