Transformer-based models have achieved state-of-the-art performance in many areas. However, the quadratic complexity of self-attention with respect to the input length hinders the applicability of Transformer-based models to long sequences. To address this, we present Fast Multipole Attention, a new attention mechanism that uses a divide-and-conquer strategy to reduce the time and memory complexity of attention for sequences of length $n$ from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$ or $O(n)$, while retaining a global receptive field. The hierarchical approach groups queries, keys, and values into $\mathcal{O}( \log n)$ levels of resolution, where groups at greater distances are increasingly larger in size and the weights to compute group quantities are learned. As such, the interaction between tokens far from each other is considered in lower resolution in an efficient hierarchical manner. The overall complexity of Fast Multipole Attention is $\mathcal{O}(n)$ or $\mathcal{O}(n \log n)$, depending on whether the queries are down-sampled or not. This multi-level divide-and-conquer strategy is inspired by fast summation methods from $n$-body physics and the Fast Multipole Method. We perform evaluation on autoregressive and bidirectional language modeling tasks and compare our Fast Multipole Attention model with other efficient attention variants on medium-size datasets. We find empirically that the Fast Multipole Transformer performs much better than other efficient transformers in terms of memory size and accuracy. The Fast Multipole Attention mechanism has the potential to empower large language models with much greater sequence lengths, taking the full context into account in an efficient, naturally hierarchical manner during training and when generating long sequences.
翻译:基于Transformer的模型已在许多领域取得了最先进的性能。然而,自注意力机制相对于输入长度的二次复杂度阻碍了基于Transformer的模型在长序列上的应用。为了解决这一问题,我们提出了快速多极注意力(Fast Multipole Attention),这是一种新的注意力机制,采用分治策略将长度为$n$的序列的注意力时间和内存复杂度从$\mathcal{O}(n^2)$降至$\mathcal{O}(n \log n)$或$\mathcal{O}(n)$,同时保持全局感受野。该层次方法将查询、键和值划分为$\mathcal{O}(\log n)$个分辨率层级,其中距离越远的组规模越大,用于计算组内量的权重是可学习的。通过这种方式,远距离令牌之间的交互以较低分辨率在高效的层次结构中得以处理。快速多极注意力的总体复杂度为$\mathcal{O}(n)$或$\mathcal{O}(n \log n)$,具体取决于是否对查询进行下采样。这种多层次分治策略受$n$体物理学中的快速求和方法和快速多极方法(Fast Multipole Method)启发。我们在自回归和双向语言建模任务上进行了评估,并将我们的快速多极注意力模型与其他高效注意力变体在中型数据集上进行了比较。实验发现,快速多极Transformer在内存大小和准确性方面远优于其他高效Transformer。快速多极注意力机制有望赋能大规模语言模型处理更长的序列,在训练和生成长序列时以高效、自然的层次方式充分考虑完整上下文。