Concept Bottleneck Models (CBMs) map the inputs onto a set of interpretable concepts (``the bottleneck'') and use the concepts to make predictions. A concept bottleneck enhances interpretability since it can be investigated to understand what concepts the model "sees" in an input and which of these concepts are deemed important. However, CBMs are restrictive in practice as they require dense concept annotations in the training data to learn the bottleneck. Moreover, CBMs often do not match the accuracy of an unrestricted neural network, reducing the incentive to deploy them in practice. In this work, we address these limitations of CBMs by introducing Post-hoc Concept Bottleneck models (PCBMs). We show that we can turn any neural network into a PCBM without sacrificing model performance while still retaining the interpretability benefits. When concept annotations are not available on the training data, we show that PCBM can transfer concepts from other datasets or from natural language descriptions of concepts via multimodal models. A key benefit of PCBM is that it enables users to quickly debug and update the model to reduce spurious correlations and improve generalization to new distributions. PCBM allows for global model edits, which can be more efficient than previous works on local interventions that fix a specific prediction. Through a model-editing user study, we show that editing PCBMs via concept-level feedback can provide significant performance gains without using data from the target domain or model retraining.
翻译:概念瓶颈模型(CBMs)将输入映射到一组可解释概念(“瓶颈”),并利用这些概念进行预测。概念瓶颈增强了可解释性,因为可以通过研究它来理解模型在输入中“看到”哪些概念,以及其中哪些概念被视为重要。然而,CBMs在实践中具有局限性,因为它们需要训练数据中的密集概念注释来学习瓶颈。此外,CBMs的准确性通常不如无约束神经网络,降低了实践中部署它们的动力。在这项工作中,我们通过引入事后概念瓶颈模型(PCBMs)来解决CBMs的这些局限性。我们证明,可以将任何神经网络转化为PCBM,而不牺牲模型性能,同时仍保留可解释性优势。当训练数据中没有概念注释时,我们表明PCBM可以通过多模态模型从其他数据集或概念的自然语言描述中迁移概念。PCBM的一个关键优势在于,它使用户能够快速调试和更新模型,以减少虚假关联并改进对新分布的泛化能力。PCBM允许进行全局模型编辑,这比以往针对特定预测进行局部干预的工作更为高效。通过一项模型编辑用户研究,我们表明,通过概念级反馈编辑PCBM可以在不使用目标领域数据或重新训练模型的情况下实现显著的性能提升。