Deep learning models have achieved high performance in medical applications, however, their adoption in clinical practice is hindered due to their black-box nature. Self-explainable models, like prototype-based models, can be especially beneficial as they are interpretable by design. However, if the learnt prototypes are of low quality then the prototype-based models are as good as black-box. Having high quality prototypes is a pre-requisite for a truly interpretable model. In this work, we propose a prototype evaluation framework for coherence (PEF-C) for quantitatively evaluating the quality of the prototypes based on domain knowledge. We show the use of PEF-C in the context of breast cancer prediction using mammography. Existing works on prototype-based models on breast cancer prediction using mammography have focused on improving the classification performance of prototype-based models compared to black-box models and have evaluated prototype quality through anecdotal evidence. We are the first to go beyond anecdotal evidence and evaluate the quality of the mammography prototypes systematically using our PEF-C. Specifically, we apply three state-of-the-art prototype-based models, ProtoPNet, BRAIxProtoPNet++ and PIP-Net on mammography images for breast cancer prediction and evaluate these models w.r.t. i) classification performance, and ii) quality of the prototypes, on three public datasets. Our results show that prototype-based models are competitive with black-box models in terms of classification performance, and achieve a higher score in detecting ROIs. However, the quality of the prototypes are not yet sufficient and can be improved in aspects of relevance, purity and learning a variety of prototypes. We call the XAI community to systematically evaluate the quality of the prototypes to check their true usability in high stake decisions and improve such models further.
翻译:深度学习模型在医学应用中取得了高性能,然而,由于其黑箱特性,在临床实践中的采用受到阻碍。自解释模型,如基于原型的模型,因其设计上的可解释性而特别有益。然而,如果学习到的原型质量低下,则基于原型的模型与黑箱模型无异。拥有高质量原型是实现真正可解释模型的先决条件。在本工作中,我们提出了基于领域知识定量评估原型质量的原型相干性评估框架(PEF-C)。我们展示了PEF-C在利用钼靶X线摄影进行乳腺癌预测场景中的应用。现有关于基于原型的模型在利用钼靶X线摄影进行乳腺癌预测的工作,主要侧重于提升基于原型模型相比黑箱模型的分类性能,并通过轶事证据评估原型质量。我们是首个超越轶事证据、使用我们提出的PEF-C系统评估钼靶X线摄影原型质量的研究。具体而言,我们将三种最先进的基于原型的模型——ProtoPNet、BRAIxProtoPNet++和PIP-Net——应用于钼靶X线摄影图像进行乳腺癌预测,并在三个公开数据集上评估这些模型在以下两方面的表现:i) 分类性能,以及ii) 原型质量。我们的结果表明,基于原型的模型在分类性能上与黑箱模型相当,并在检测感兴趣区域(ROI)方面取得了更高分数。然而,原型质量尚不充分,在相关性、纯净度以及学习多样性原型方面仍有改进空间。我们呼吁可解释人工智能(XAI)社区系统评估原型质量,以检验其在高风险决策中的真正可用性,并进一步改进此类模型。