While Transformers have rapidly gained popularity in various computer vision applications, post-hoc explanations of their internal mechanisms remain largely unexplored. Vision Transformers extract visual information by representing image regions as transformed tokens and integrating them via attention weights. However, existing post-hoc explanation methods merely consider these attention weights, neglecting crucial information from the transformed tokens, which fails to accurately illustrate the rationales behind the models' predictions. To incorporate the influence of token transformation into interpretation, we propose TokenTM, a novel post-hoc explanation method that utilizes our introduced measurement of token transformation effects. Specifically, we quantify token transformation effects by measuring changes in token lengths and correlations in their directions pre- and post-transformation. Moreover, we develop initialization and aggregation rules to integrate both attention weights and token transformation effects across all layers, capturing holistic token contributions throughout the model. Experimental results on segmentation and perturbation tests demonstrate the superiority of our proposed TokenTM compared to state-of-the-art Vision Transformer explanation methods.
翻译:摘要:尽管Transformer在各类计算机视觉应用中迅速普及,但其内部机制的事后解释仍鲜有探索。视觉Transformer通过将图像区域表示为变换后的令牌,并借助注意力权重对其进行整合来提取视觉信息。然而,现有的事后解释方法仅考虑这些注意力权重,忽略了来自变换后令牌的关键信息,无法准确阐明模型预测背后的依据。为将令牌变换的影响纳入解释,我们提出TokenTM——一种新颖的事后解释方法,该方法利用我们引入的令牌变换效应度量。具体而言,我们通过测量变换前后令牌长度的变化及其方向上的相关性来量化令牌变换效应。此外,我们开发了初始化和聚合规则,以整合所有层的注意力权重与令牌变换效应,从而捕捉整个模型中令牌的整体贡献。在分割和扰动测试上的实验结果表明,我们提出的TokenTM相较于当前最先进的视觉Transformer解释方法具有优越性。