The attention mechanism is a fundamental component of the Transformer model, contributing to interactions among distinct tokens, in contrast to earlier feed-forward neural networks. In general, the attention scores are determined simply by the key-query products. However, this work's occasional trial (combining DAPE and NoPE) of including additional MLPs on attention scores without position encoding indicates that the classical key-query multiplication may limit the performance of Transformers. In this work, we conceptualize attention as a feature map and apply the convolution operator (for neighboring attention scores across different heads) to mimic the processing methods in computer vision. Specifically, the main contribution of this paper is identifying and interpreting the Transformer length extrapolation problem as a result of the limited expressiveness of the naive query and key dot product, and we successfully translate the length extrapolation issue into a well-understood feature map processing problem. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution. Extensive experiments demonstrate that treating attention as a feature map and applying convolution as a processing method significantly enhances Transformer performance.
翻译:注意力机制是Transformer模型的基本组成部分,与早期的前馈神经网络不同,它促进了不同词元之间的交互。通常,注意力分数仅由键-查询乘积决定。然而,本工作的一项偶然尝试(结合DAPE和NoPE)表明,在没有位置编码的情况下,在注意力分数上添加额外的多层感知器,可能意味着经典的键-查询乘法限制了Transformer的性能。在本工作中,我们将注意力概念化为特征图,并应用卷积算子(针对不同注意力头中的相邻注意力分数)来模拟计算机视觉中的处理方法。具体而言,本文的主要贡献在于将Transformer长度外推问题识别并解释为朴素查询与键点积表达能力有限的结果,并成功地将长度外推问题转化为一个易于理解的特征图处理问题。这一新颖的见解可适用于各种与注意力相关的模型,揭示了当前Transformer架构具有进一步演化的潜力。大量实验表明,将注意力视为特征图并应用卷积作为处理方法,能显著提升Transformer的性能。