Vision Transformer(ViT) is one of the most widely used models in the computer vision field with its great performance on various tasks. In order to fully utilize the ViT-based architecture in various applications, proper visualization methods with a decent localization performance are necessary, but these methods employed in CNN-based models are still not available in ViT due to its unique structure. In this work, we propose an attention-guided visualization method applied to ViT that provides a high-level semantic explanation for its decision. Our method selectively aggregates the gradients directly propagated from the classification output to each self-attention, collecting the contribution of image features extracted from each location of the input image. These gradients are additionally guided by the normalized self-attention scores, which are the pairwise patch correlation scores. They are used to supplement the gradients on the patch-level context information efficiently detected by the self-attention mechanism. This approach of our method provides elaborate high-level semantic explanations with great localization performance only with the class labels. As a result, our method outperforms the previous leading explainability methods of ViT in the weakly-supervised localization task and presents great capability in capturing the full instances of the target class object. Meanwhile, our method provides a visualization that faithfully explains the model, which is demonstrated in the perturbation comparison test.
翻译:视觉Transformer(ViT)凭借其在各类任务上的卓越表现,已成为计算机视觉领域应用最广泛的模型之一。为充分挖掘基于ViT架构在多种场景中的应用潜力,亟需具备良好定位性能的可视化方法,但现有适用于CNN模型的可视化方法因ViT独特的结构无法直接迁移。本文提出一种应用于ViT的注意力引导可视化方法,为模型决策提供高层语义解释。该方法选择性聚合从分类输出直接传播至各自注意力模块的梯度,采集输入图像各位置提取的图像特征贡献。这些梯度进一步通过归一化自注意力分数(即成对图像块相关性分数)进行引导,用于补充自注意力机制高效检测的块级上下文信息梯度。本文方法仅需类别标签即可生成具有卓越定位性能的精细高层语义解释。实验结果表明,本方法在弱监督定位任务中优于现有主流ViT可解释性方法,并展现出对目标类别物体完整实例的出色捕获能力。同时,扰动对比测试证明,本方法提供的可视化结果能忠实反映模型决策依据。