Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models

from arxiv, Technical Report. Yiran Zhong is the corresponding author. The source code is available at https://github.com/OpenNLPLab/lightning-attention

Linear attention is an efficient attention mechanism that has recently emerged as a promising alternative to conventional softmax attention. With its ability to process tokens in linear computational complexities, linear attention, in theory, can handle sequences of unlimited length without sacrificing speed, i.e., maintaining a constant training speed for various sequence lengths with a fixed memory consumption. However, due to the issue with cumulative summation (cumsum), current linear attention algorithms cannot demonstrate their theoretical advantage in a causal setting. In this paper, we present Lightning Attention-2, the first linear attention implementation that enables linear attention to realize its theoretical computational benefits. To achieve this, we leverage the thought of tiling, separately handling the intra-block and inter-block components in linear attention calculation. Specifically, we utilize the conventional attention computation mechanism for the intra-blocks and apply linear attention kernel tricks for the inter-blocks. A tiling technique is adopted through both forward and backward procedures to take full advantage of the GPU hardware. We implement our algorithm in Triton to make it IO-aware and hardware-friendly. Various experiments are conducted on different model sizes and sequence lengths. Lightning Attention-2 retains consistent training and inference speed regardless of input sequence length and is significantly faster than other attention mechanisms. The source code is available at https://github.com/OpenNLPLab/lightning-attention.

翻译：线性注意力是一种高效的注意力机制，近期作为传统softmax注意力的有前景替代方案崭露头角。凭借以线性计算复杂度处理词元的能力，线性注意力理论上能够在保持恒定训练速度（即针对不同序列长度以固定内存消耗维持匀速处理）的同时处理无限长度的序列。然而，由于累积求和（cumsum）问题，当前线性注意力算法在因果设置下无法展现其理论优势。本文提出闪电注意力-2（Lightning Attention-2），这是首个实现线性注意力理论计算优势的线性注意力方案。为实现这一目标，我们采用分块思想，分别处理线性注意力计算中的块内和块间组件。具体而言，我们对块内组件使用传统注意力计算机制，对块间组件应用线性注意力核技巧。在前向和反向过程中均采用分块技术以充分利用GPU硬件。我们在Triton中实现该算法，使其具备IO感知能力和硬件友好性。针对不同模型规模和序列长度开展了多项实验。闪电注意力-2无论输入序列长度如何，均能保持一致的训练和推理速度，且显著快于其他注意力机制。源代码已开源至https://github.com/OpenNLPLab/lightning-attention。