Large Language Models (LLMs) have profoundly changed the world. Their self-attention mechanism is the key to the success of transformers in LLMs. However, the quadratic computational cost $O(n^2)$ to the length $n$ input sequence is the notorious obstacle for further improvement and scalability in the longer context. In this work, we leverage the convolution-like structure of attention matrices to develop an efficient approximation method for attention computation using convolution matrices. We propose a $\mathsf{conv}$ basis system, "similar" to the rank basis, and show that any lower triangular (attention) matrix can always be decomposed as a sum of $k$ structured convolution matrices in this basis system. We then design an algorithm to quickly decompose the attention matrix into $k$ convolution matrices. Thanks to Fast Fourier Transforms (FFT), the attention {\it inference} can be computed in $O(knd \log n)$ time, where $d$ is the hidden dimension. In practice, we have $ d \ll n$, i.e., $d=3,072$ and $n=1,000,000$ for Gemma. Thus, when $kd = n^{o(1)}$, our algorithm achieve almost linear time, i.e., $n^{1+o(1)}$. Furthermore, the attention {\it training forward} and {\it backward gradient} can be computed in $n^{1+o(1)}$ as well. Our approach can avoid explicitly computing the $n \times n$ attention matrix, which may largely alleviate the quadratic computational complexity. Furthermore, our algorithm works on any input matrices. This work provides a new paradigm for accelerating attention computation in transformers to enable their application to longer contexts.
翻译:大型语言模型深刻改变了世界。其自注意力机制是Transformer在大型语言模型中成功的关键。然而,该机制与输入序列长度n呈二次计算复杂度$O(n^2)$,这成为在更长上下文中进一步提升性能和可扩展性的显著障碍。本文利用注意力矩阵的卷积结构特性,提出了一种基于卷积矩阵的高效注意力近似计算方法。我们构建了一个"类似"秩基的$\mathsf{conv}$基系统,并证明任意下三角(注意力)矩阵均可分解为该基系统中$k$个结构化卷积矩阵之和。随后设计了快速分解注意力矩阵为$k$个卷积矩阵的算法。得益于快速傅里叶变换,注意力{\it 推理}可在$O(knd \log n)$时间内完成,其中$d$为隐藏维度。实际应用中$d\ll n$,例如Gemma模型中$d=3072$而$n=1,000,000$。因此当$kd=n^{o(1)}$时,算法实现近线性时间$n^{1+o(1)}$。此外,注意力{\it 训练前向}与{\it 反向梯度}计算同样可在$n^{1+o(1)}$内完成。本方法可避免显式计算$n\times n$注意力矩阵,极大缓解二次计算复杂度问题。且算法适用于任意输入矩阵。本研究为加速Transformer注意力计算提供了新范式,使其能应用于更长上下文场景。