The Vision Transformer (ViT) demonstrates exceptional performance in various computer vision tasks. Attention is crucial for ViT to capture complex wide-ranging relationships among image patches, allowing the model to weigh the importance of image patches and aiding our understanding of the decision-making process. However, when utilizing the attention of ViT as evidence in high-stakes decision-making tasks such as medical diagnostics, a challenge arises due to the potential of attention mechanisms erroneously focusing on irrelevant regions. In this study, we propose a statistical test for ViT's attentions, enabling us to use the attentions as reliable quantitative evidence indicators for ViT's decision-making with a rigorously controlled error rate. Using the framework called selective inference, we quantify the statistical significance of attentions in the form of p-values, which enables the theoretically grounded quantification of the false positive detection probability of attentions. We demonstrate the validity and the effectiveness of the proposed method through numerical experiments and applications to brain image diagnoses.
翻译:视觉Transformer(ViT)在各类计算机视觉任务中展现出卓越性能。注意力机制对于ViT捕捉图像块间复杂的全局关系至关重要,它能帮助模型权衡不同图像块的重要性,并促进我们对决策过程的理解。然而,在医学诊断等高风险决策场景中,将ViT的注意力作为证据使用时,会面临注意力机制可能错误聚焦于无关区域这一挑战。本研究针对ViT的注意力提出了一种统计检验方法,使我们能够在严格控制错误率的前提下,将注意力作为可靠的定量证据指标用于ViT的决策过程。通过采用选择性推断框架,我们以p值形式量化了注意力的统计显著性,从而从理论上实现了对注意力假阳性检测概率的量化。通过数值实验以及在脑部图像诊断中的应用,我们验证了所提方法的有效性与实用性。