With the advent of large models based on the Transformer architecture, researchers have observed an anomalous phenomenon in the Attention mechanism--there is a very high attention on the first element, which is prevalent across Transformer-based models. It is crucial to understand it for the development of techniques focusing on attention distribution, such as Key-Value (KV) Cache compression and infinite extrapolation; however, the latent cause leaves to be unknown. In this paper, we analyze such a phenomenon from the perspective of waiver phenomenon, which involves reducing the internal values of certain elements in the sequence, allowing them to absorb excess attention without affecting their contribution to information. In specific models, due to differences in positional encoding and attention patterns, we have found that the selection of waiver elements by the model can be categorized into two methods: positional-encoding-based and feature-distribution-within-elements-based.
翻译:随着基于Transformer架构的大型模型的出现,研究人员在注意力机制中观察到一个异常现象——对首个元素存在极高的注意力,这在基于Transformer的模型中普遍存在。理解这一现象对于发展关注注意力分布的技术(如键值缓存压缩和无限外推)至关重要;然而,其潜在原因尚不明确。在本文中,我们从豁免现象的角度分析了这一现象,该现象涉及降低序列中某些元素的内部值,使其能够吸收多余的注意力而不影响其对信息的贡献。在特定模型中,由于位置编码和注意力模式的差异,我们发现模型对豁免元素的选择可分为两种方法:基于位置编码的和基于元素内部特征分布的。