Pairwise dot-product self-attention is key to the success of transformers that achieve state-of-the-art performance across a variety of applications in language and vision. This dot-product self-attention computes attention weights among the input tokens using Euclidean distance, which makes the model prone to representation collapse and vulnerable to contaminated samples. In this paper, we propose using a Mahalanobis distance metric for computing the attention weights to stretch the underlying feature space in directions of high contextual relevance. In particular, we define a hyper-ellipsoidal neighborhood around each query to increase the attention weights of the tokens lying in the contextually important directions. We term this novel class of attention Elliptical Attention. Our Elliptical Attention provides two benefits: 1) reducing representation collapse and 2) enhancing the model's robustness as Elliptical Attention pays more attention to contextually relevant information rather than focusing on some small subset of informative features. We empirically demonstrate the advantages of Elliptical Attention over the baseline dot-product attention and state-of-the-art attention methods on various practical tasks, including object classification, image segmentation, and language modeling across different data modalities.
翻译:成对点积自注意力是Transformer成功的关键,该模型在语言和视觉领域的各类应用中均实现了最先进的性能。这种点积自注意力使用欧氏距离计算输入标记间的注意力权重,使得模型容易发生表示坍缩并对污染样本敏感。本文提出使用马氏距离度量计算注意力权重,以沿上下文相关性高的方向拉伸底层特征空间。具体而言,我们在每个查询周围定义超椭球邻域,以提高位于上下文重要方向上的标记的注意力权重。我们将这类新型注意力命名为椭圆注意力。我们的椭圆注意力具有双重优势:1)减轻表示坍缩;2)增强模型鲁棒性,因为椭圆注意力更关注上下文相关信息,而非聚焦于信息特征的某个小子集。我们在多种实际任务上实证验证了椭圆注意力相较于基线点积注意力及现有先进注意力方法的优势,这些任务涵盖不同数据模态下的目标分类、图像分割和语言建模。