LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shifted sparse attention (S$^2$-Attn) effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA combines this improved LoRA with S$^2$-Attn. LongLoRA demonstrates strong empirical results on various tasks on Llama2 models from 7B/13B to 70B. LongLoRA adopts Llama2 7B from 4k context to 100k, or Llama2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like Flash-Attention2. In addition, we further conduct supervised fine-tuning with LongLoRA and our long instruction-following LongAlpaca dataset.

翻译：我们提出LongLoRA，一种高效扩展预训练大语言模型（LLMs）上下文长度的微调方法，仅需有限的计算成本。训练长上下文LLMs通常计算昂贵，需要大量的训练时间和GPU资源。例如，在8192上下文长度上训练时，自注意力层的计算成本是2048长度下的16倍。本文从两方面加速LLMs的上下文扩展：一方面，尽管推理阶段需要密集全局注意力，但微调阶段可通过稀疏局部注意力高效实现。所提出的移位稀疏注意力（S$^2$-Attn）有效支持上下文扩展，在保持与标准注意力微调相似性能的同时实现显著的计算节省。特别地，该方法在训练阶段仅需两行代码即可实现，且推理阶段为可选。另一方面，我们重新审视了用于上下文扩展的参数高效微调范式。值得注意的是，我们发现LoRA在上下文扩展中仅在可训练嵌入和归一化条件下表现良好。LongLoRA将此改进的LoRA与S$^2$-Attn相结合。在Llama2系列模型（7B/13B至70B）的多个任务上，LongLoRA展现出显著的实证效果。在单台8×A100机器上，LongLoRA将Llama2 7B的上下文从4k扩展至100k，或将Llama2 70B扩展至32k。LongLoRA在扩展模型上下文的同时保留其原始架构，并与Flash-Attention2等现有技术兼容。此外，我们进一步利用LongLoRA及所提出的长指令遵循数据集LongAlpaca进行了监督微调。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日