The Vision Transformer (ViT) demonstrates exceptional performance in various computer vision tasks. Attention is crucial for ViT to capture complex wide-ranging relationships among image patches, allowing the model to weigh the importance of image patches and aiding our understanding of the decision-making process. However, when utilizing the attention of ViT as evidence in high-stakes decision-making tasks such as medical diagnostics, a challenge arises due to the potential of attention mechanisms erroneously focusing on irrelevant regions. In this study, we propose a statistical test for ViT's attentions, enabling us to use the attentions as reliable quantitative evidence indicators for ViT's decision-making with a rigorously controlled error rate. Using the framework called selective inference, we quantify the statistical significance of attentions in the form of p-values, which enables the theoretically grounded quantification of the false positive detection probability of attentions. We demonstrate the validity and the effectiveness of the proposed method through numerical experiments and applications to brain image diagnoses.
翻译:视觉Transformer(Vision Transformer,ViT)在各种计算机视觉任务中展现出卓越的性能。注意力机制对于ViT捕捉图像块之间复杂的广泛关系至关重要,它使模型能够权衡图像块的重要性,并有助于我们理解其决策过程。然而,在将ViT的注意力用于医学诊断等高风险决策任务中的证据时,由于注意力机制可能错误地聚焦于无关区域,因此面临挑战。在本研究中,我们提出了针对ViT注意力的统计检验方法,使我们能够将注意力作为ViT决策的可靠定量证据指标,并严格控制错误率。通过称为选择推断(selective inference)的框架,我们以p值形式量化注意力统计显著性,从而在理论上实现注意力误报检测概率的可量化。我们通过数值实验及在脑部图像诊断中的应用,证明了所提出方法的有效性与实用性。