Although Vision Transformers (ViTs) have recently demonstrated superior performance in medical imaging problems, they face explainability issues similar to previous architectures such as convolutional neural networks. Recent research efforts suggest that attention maps, which are part of decision-making process of ViTs can potentially address the explainability issue by identifying regions influencing predictions, especially in models pretrained with self-supervised learning. In this work, we compare the visual explanations of attention maps to other commonly used methods for medical imaging problems. To do so, we employ four distinct medical imaging datasets that involve the identification of (1) colonic polyps, (2) breast tumors, (3) esophageal inflammation, and (4) bone fractures and hardware implants. Through large-scale experiments on the aforementioned datasets using various supervised and self-supervised pretrained ViTs, we find that although attention maps show promise under certain conditions and generally surpass GradCAM in explainability, they are outperformed by transformer-specific interpretability methods. Our findings indicate that the efficacy of attention maps as a method of interpretability is context-dependent and may be limited as they do not consistently provide the comprehensive insights required for robust medical decision-making.
翻译:尽管视觉Transformer(ViTs)最近在医学影像问题中展现出卓越性能,但其可解释性问题与卷积神经网络等先前架构类似。近期研究表明,作为ViTs决策过程组成部分的注意力图,能够通过识别影响预测的关键区域来缓解可解释性问题,特别是在自监督学习预训练的模型中。本研究将注意力图的可视化解释与医学影像领域其他常用方法进行比较。为此,我们采用四个不同的医学影像数据集,涉及(1)结肠息肉、(2)乳腺肿瘤、(3)食管炎症以及(4)骨折与内固定植入物的识别。通过对上述数据集使用多种监督与自监督预训练ViTs进行大规模实验,我们发现:尽管注意力图在特定条件下表现良好,且其可解释性普遍优于GradCAM,但仍逊色于针对Transformer设计的可解释性方法。研究结果表明,注意力图作为可解释性方法的有效性具有情境依赖性,因其无法持续提供稳健医疗决策所需的全面洞察,其应用可能存在局限性。