Self-attention weights and their transformed variants have been the main source of information for analyzing token-to-token interactions in Transformer-based models. But despite their ease of interpretation, these weights are not faithful to the models' decisions as they are only one part of an encoder, and other components in the encoder layer can have considerable impact on information mixing in the output representations. In this work, by expanding the scope of analysis to the whole encoder block, we propose Value Zeroing, a novel context mixing score customized for Transformers that provides us with a deeper understanding of how information is mixed at each encoder layer. We demonstrate the superiority of our context mixing score over other analysis methods through a series of complementary evaluations with different viewpoints based on linguistically informed rationales, probing, and faithfulness analysis.
翻译:自注意力权重及其变换形式一直是分析基于Transformer模型中词元间交互的主要信息来源。然而,尽管这些权重易于解释,但它们并不能忠实反映模型的决策——因为权重仅是编码器的一部分,而编码器层中的其他组件对输出表征中的信息混合具有显著影响。本研究通过将分析范围扩展至整个编码器模块,提出了一种专为Transformer定制的全新上下文混合评分方法——值归零法(Value Zeroing),该方法能够深入揭示每个编码器层的信息混合机制。我们基于语言学理论依据、探测分析及忠实度评估等多维度视角,通过一系列互补性评估实验,证明了所提出的上下文混合评分方法相较于其他分析方法的优越性。