We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shifted sparse attention (S$^2$-Attn) effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA combines this improved LoRA with S$^2$-Attn. LongLoRA demonstrates strong empirical results on various tasks on Llama2 models from 7B/13B to 70B. LongLoRA adopts Llama2 7B from 4k context to 100k, or Llama2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like Flash-Attention2. In addition, we further conduct supervised fine-tuning with LongLoRA and our long instruction-following LongAlpaca dataset.
翻译:我们提出LongLoRA,一种高效扩展预训练大语言模型(LLMs)上下文长度的微调方法,仅需有限的计算成本。训练长上下文LLMs通常计算昂贵,需要大量的训练时间和GPU资源。例如,在8192上下文长度上训练时,自注意力层的计算成本是2048长度下的16倍。本文从两方面加速LLMs的上下文扩展:一方面,尽管推理阶段需要密集全局注意力,但微调阶段可通过稀疏局部注意力高效实现。所提出的移位稀疏注意力(S$^2$-Attn)有效支持上下文扩展,在保持与标准注意力微调相似性能的同时实现显著的计算节省。特别地,该方法在训练阶段仅需两行代码即可实现,且推理阶段为可选。另一方面,我们重新审视了用于上下文扩展的参数高效微调范式。值得注意的是,我们发现LoRA在上下文扩展中仅在可训练嵌入和归一化条件下表现良好。LongLoRA将此改进的LoRA与S$^2$-Attn相结合。在Llama2系列模型(7B/13B至70B)的多个任务上,LongLoRA展现出显著的实证效果。在单台8×A100机器上,LongLoRA将Llama2 7B的上下文从4k扩展至100k,或将Llama2 70B扩展至32k。LongLoRA在扩展模型上下文的同时保留其原始架构,并与Flash-Attention2等现有技术兼容。此外,我们进一步利用LongLoRA及所提出的长指令遵循数据集LongAlpaca进行了监督微调。