We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shifted sparse attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA combines this improved LoRA with S^2-Attn. LongLoRA demonstrates strong empirical results on various tasks on Llama2 models from 7B/13B to 70B. LongLoRA extends Llama2 7B from 4k context to 100k, or Llama2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like Flash-Attention2. In addition, we further conduct supervised fine-tuning with LongLoRA and our long instruction-following LongAlpaca dataset.
翻译:我们提出LongLoRA,一种在有限计算成本下扩展预训练大型语言模型上下文长度的高效微调方法。通常,训练长上下文的大型语言模型计算开销极大,需要大量训练时间和GPU资源。例如,在8192上下文长度上训练时,自注意力层的计算成本是2048长度时的16倍。本文从两个角度加速大型语言模型的上下文扩展:一方面,尽管推理时需要密集全局注意力,但微调阶段可通过稀疏局部注意力高效实现。我们提出的移位稀疏注意力有效支持上下文扩展,在保持与标准注意力微调相近性能的同时大幅降低计算消耗。特别地,该注意力机制在训练时仅需两行代码即可实现,且在推理时具有可选择性。另一方面,我们重新审视了参数高效微调范式在上下文扩展中的应用,发现LoRA在可训练嵌入和归一化条件下能良好适配上下文扩展任务。LongLoRA将改进的LoRA与S²-Attn相结合,在Llama2系列模型(7B/13B至70B)的多个任务上展现出优异的实验效果。通过LongLoRA,我们能在单台8×A100机器上将Llama2 7B模型的上下文从4k扩展至100k,或将Llama2 70B模型扩展至32k。该方法在扩展模型上下文的同时保留其原始架构,且兼容Flash-Attention2等现有技术。此外,我们进一步利用LongLoRA和自建的长指令跟随数据集LongAlpaca进行了监督微调。