Recently, interpretable machine learning has re-explored concept bottleneck models (CBM). An advantage of this model class is the user's ability to intervene on predicted concept values, affecting the downstream output. In this work, we introduce a method to perform such concept-based interventions on pretrained neural networks, which are not interpretable by design, only given a small validation set with concept labels. Furthermore, we formalise the notion of intervenability as a measure of the effectiveness of concept-based interventions and leverage this definition to fine-tune black boxes. Empirically, we explore the intervenability of black-box classifiers on synthetic tabular and natural image benchmarks. We focus on backbone architectures of varying complexity, from simple, fully connected neural nets to Stable Diffusion. We demonstrate that the proposed fine-tuning improves intervention effectiveness and often yields better-calibrated predictions. To showcase the practical utility of our techniques, we apply them to deep chest X-ray classifiers and show that fine-tuned black boxes are more intervenable than CBMs. Lastly, we establish that our methods are still effective under vision-language-model-based concept annotations, alleviating the need for a human-annotated validation set.
翻译:近年来,可解释机器学习领域重新探索了概念瓶颈模型(CBM)。此类模型的优势在于用户能够对预测的概念值进行干预,从而影响下游输出。本研究提出一种方法,可在仅给定带有概念标签的小型验证集的情况下,对预先训练且非按可解释性设计的神经网络实施此类基于概念的干预。此外,我们将可干预性形式化为衡量基于概念干预有效性的指标,并利用该定义对黑盒模型进行微调。我们通过实验探究了黑盒分类器在合成表格数据与自然图像基准测试中的可干预性,重点关注从简单的全连接神经网络到Stable Diffusion等不同复杂度的骨干架构。实验表明,所提出的微调方法能提升干预有效性,且通常能产生更校准的预测结果。为展示本技术的实际效用,我们将其应用于深度胸部X光分类器,结果显示微调后的黑盒模型比CBM具备更强的可干预性。最后,我们证实该方法在基于视觉语言模型的概念标注下依然有效,从而减少对人类标注验证集的依赖。