Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. In this work, we introduce LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. Specifically, we propose dilated attention, which expands the attentive field exponentially as the distance grows. LongNet has significant advantages: 1) it has a linear computation complexity and a logarithm dependency between tokens; 2) it can be served as a distributed trainer for extremely long sequences; 3) its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization. Experiments results demonstrate that LongNet yields strong performance on both long-sequence modeling and general language tasks. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
翻译:序列长度扩展已成为大语言模型时代的关键需求。然而,现有方法受限于计算复杂度或模型表现力,导致最大序列长度受限。本文提出LongNet——一种可将序列长度扩展至超过十亿token的Transformer变体,且不牺牲短序列的处理性能。具体而言,我们提出扩张注意力机制(dilated attention),该机制随距离增长呈指数级扩大注意力感受野。LongNet具有显著优势:1)计算复杂度为线性,且token间呈对数依赖关系;2)可作为超长序列的分布式训练器;3)其扩张注意力可直接替换标准注意力机制,无缝集成现有基于Transformer的优化方案。实验结果表明,LongNet在长序列建模与通用语言任务上均展现出强劲性能。本研究为建模超长序列开辟了新可能性,例如将整个语料库甚至整个互联网视为单一序列。