Recently, interpretable machine learning has re-explored concept bottleneck models (CBM), comprising step-by-step prediction of the high-level concepts from the raw features and the target variable from the predicted concepts. A compelling advantage of this model class is the user's ability to intervene on the predicted concept values, affecting the model's downstream output. In this work, we introduce a method to perform such concept-based interventions on already-trained neural networks, which are not interpretable by design, given an annotated validation set. Furthermore, we formalise the model's intervenability as a measure of the effectiveness of concept-based interventions and leverage this definition to fine-tune black-box models. Empirically, we explore the intervenability of black-box classifiers on synthetic tabular and natural image benchmarks. We demonstrate that fine-tuning improves intervention effectiveness and often yields better-calibrated predictions. To showcase the practical utility of the proposed techniques, we apply them to deep chest X-ray classifiers and show that fine-tuned black boxes can be as intervenable and more performant than CBMs.
翻译:近年来,可解释机器学习重新探索了概念瓶颈模型(CBM),该模型包含从原始特征逐步预测高级概念,再基于预测概念预测目标变量。这类模型的一个显著优势在于,用户能够干预预测的概念值,从而影响模型的下游输出。在本研究中,我们提出一种方法,在给定标注验证集的条件下,对已经训练完成且在设计上不可解释的神经网络执行基于概念的干预。此外,我们将模型的可干预性形式化定义为基于概念干预的有效性度量,并利用这一定义对黑箱模型进行微调。通过实验,我们在合成表格数据和自然图像基准上探索了黑箱分类器的可干预性。结果表明,微调能够提升干预效果,并通常带来更优校准的预测。为展示所提技术的实际应用价值,我们将其应用于深度胸部X光分类器,并证实经微调的黑箱模型在可干预性上可媲美CBM,且性能更优。