As the demand for processing extended textual data grows, the ability to handle long-range dependencies and maintain computational efficiency is more critical than ever. One of the key issues for long-sequence modeling using attention-based model is the mismatch between the limited-range modeling power of full attention and the long-range token dependency in the input sequence. In this work, we propose to scale up the attention receptive field by tensorizing long input sequences into compact tensor representations followed by attention on each transformed dimension. The resulting Tensorized Attention can be adopted as efficient transformer backbones to extend input context length with improved memory and time efficiency. We show that the proposed attention tensorization encodes token dependencies as a multi-hop attention process, and is equivalent to Kronecker decomposition of full attention. Extensive experiments show that tensorized attention can be used to adapt pretrained LLMs with improved efficiency. Notably, Llama-8B with tensorization is trained under 32,768 context length and can steadily extrapolate to 128k length during inference with $11\times$ speedup, compared to full attention with FlashAttention-2.
翻译:随着处理长文本数据需求的增长,处理长程依赖关系并保持计算效率的能力变得比以往任何时候都更为关键。使用基于注意力的模型进行长序列建模的一个关键问题在于:全注意力的有限范围建模能力与输入序列中的长程词元依赖之间存在不匹配。在本工作中,我们提出通过将长输入序列张量化为紧凑的张量表示,随后在每个变换维度上应用注意力,从而扩展注意力的感受野。由此产生的张量化注意力可作为高效的Transformer骨干网络,以扩展输入上下文长度,同时提升内存和时间效率。我们证明,所提出的注意力张量化将词元依赖编码为一个多跳注意力过程,并且等价于全注意力的克罗内克分解。大量实验表明,张量化注意力可用于适配预训练的大型语言模型,并提升其效率。值得注意的是,经过张量化处理的Llama-8B模型在32,768的上下文长度下进行训练,在推理过程中能够稳定地外推至128k长度,与使用FlashAttention-2的全注意力相比,实现了11倍的加速。