The ability of Large Language Models (LLMs) to process and generate coherent text is markedly weakened when the number of input tokens exceeds their pretraining length. Given the expensive overhead of finetuning large-scale models with longer sequences, we propose Dual Chunk Attention (DCA), which enables Llama2 70B to support context windows of more than 100k tokens without continual training. By decomposing the attention computation for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens within the same chunk (Intra-Chunk) and across distinct chunks (Inter-Chunk), as well as integrates seamlessly with Flash Attention. In addition to its impressive extrapolation capability, DCA achieves performance on practical long-context tasks that is comparable to or even better than that of finetuned models. When compared with proprietary models, our training-free 70B model attains 94% of the performance of gpt-3.5-16k, indicating it is a viable open-source alternative. All code and data used in this work are released at \url{https://github.com/HKUNLP/ChunkLlama}.
翻译:大语言模型处理与生成连贯文本的能力,在输入标记数量超过其预训练长度时会显著减弱。鉴于使用更长序列微调大规模模型所需开销昂贵,本文提出双分块注意力机制,使Llama2 70B模型无需持续训练即可支持超过10万标记的上下文窗口。通过将长序列的注意力计算分解为基于分块的模块,该机制能有效捕捉同一分块内及不同分块间标记的相对位置信息,并可无缝集成Flash Attention。除具备出色的外推能力外,该机制在实际长上下文任务中的性能达到甚至超越微调模型水平。与专有模型相比,我们无需训练的70B模型性能达到gpt-3.5-16k的94%,表明其可作为可行的开源替代方案。本研究使用的全部代码与数据已发布于\url{https://github.com/HKUNLP/ChunkLlama}。