We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on LLaMA2 models from 7B/13B to 70B. LongLoRA adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like FlashAttention-2. In addition, to make LongLoRA practical, we collect a dataset, LongQA, for supervised fine-tuning. It contains more than 3k long context question-answer pairs.
翻译:我们提出LongLoRA,一种高效微调方法,能以有限的计算成本扩展预训练大语言模型的上下文长度。通常情况下,使用长上下文训练大语言模型计算开销极高,需耗费大量训练时间和GPU资源。例如,在8192上下文长度上训练时,自注意力层的计算成本是2048长度的16倍。本文从两个方面加速大语言模型的上下文扩展:一方面,尽管推理阶段需要密集全局注意力,但微调阶段可通过稀疏局部注意力高效完成。提出的移位短注意力机制能有效实现上下文扩展,相比使用标准注意力微调,在保持相似性能的同时显著降低计算量。特别地,该机制在训练中仅需两行代码实现,推理阶段可选择性使用。另一方面,我们重新审视了用于上下文扩展的参数高效微调范式。研究发现,在可训练嵌入层和归一化层的前提条件下,LoRA方法能有效支持上下文扩展。LongLoRA在7B/13B至70B规模的LLaMA2模型上展现出卓越的实验效果。使用单台8×A100设备,LongLoRA可将LLaMA2 7B从4k上下文扩展至100k,或将LLaMA2 70B扩展至32k。该方法在扩展模型上下文的同时保持原始架构不变,并兼容FlashAttention-2等主流技术。此外,为推进LongLoRA的实用性,我们构建了用于监督微调的数据集LongQA,包含3000余组长上下文问答对。