Training and serving long-context large language models (LLMs) incurs substantial overhead. To address this, two critical steps are often required: a pretrained LLM typically undergoes a separate stage for context length extension by training on long-context data, followed by architectural modifications to reduce the overhead of KV cache during serving. This paper argues that integrating length extension with a GPU-friendly KV cache reduction architecture not only reduces training overhead during length extension, but also achieves better long-context performance. This leads to our proposed LongGen, which finetunes a pretrained LLM into an efficient architecture during length extension. LongGen builds on three key insights: (1) Sparse attention patterns, such as window attention (attending to recent tokens), attention sink (initial ones), and blockwise sparse attention (strided token blocks) are well-suited for building efficient long-context models, primarily due to their GPU-friendly memory access patterns, enabling efficiency gains not just theoretically but in practice as well. (2) It is essential for the model to have direct access to all tokens. A hybrid architecture with 1/3 full attention layers and 2/3 efficient ones achieves a balanced trade-off between efficiency and long-context performance. (3) Lightweight training on 5B long-context data is sufficient to extend the hybrid model's context length from 4K to 128K. We evaluate LongGen on both Llama-2 7B and Llama-2 70B, demonstrating its effectiveness across different scales. During training with 128K-long contexts, LongGen achieves 1.55x training speedup and reduces wall-clock time by 36%, compared to a full-attention baseline. During inference, LongGen reduces KV cache memory by 62%, achieving 1.67x prefilling speedup and 1.41x decoding speedup.
翻译:训练和服务长上下文大语言模型(LLMs)会产生显著开销。为解决此问题,通常需要两个关键步骤:预训练的LLM通常需经过一个独立的上下文长度扩展阶段,通过长上下文数据进行训练,随后进行架构修改以降低服务过程中KV缓存的开销。本文认为,将长度扩展与GPU友好的KV缓存缩减架构相结合,不仅能减少长度扩展期间的训练开销,还能实现更好的长上下文性能。这促使我们提出了LongGen,它在长度扩展过程中将预训练的LLM微调为一种高效架构。LongGen基于三个关键洞见:(1)稀疏注意力模式,如窗口注意力(关注近期token)、注意力汇聚(初始token)和分块稀疏注意力(跨步token块),非常适合构建高效的长上下文模型,这主要得益于其GPU友好的内存访问模式,使得效率提升不仅在理论上可行,在实践中也能实现。(2)模型必须能够直接访问所有token。采用1/3全注意力层与2/3高效注意力层的混合架构,能在效率与长上下文性能之间实现平衡。(3)在5B长上下文数据上进行轻量级训练,足以将混合模型的上下文长度从4K扩展到128K。我们在Llama-2 7B和Llama-2 70B上评估LongGen,证明了其在不同规模下的有效性。在使用128K长上下文进行训练时,与全注意力基线相比,LongGen实现了1.55倍的训练加速,并将挂钟时间减少了36%。在推理过程中,LongGen将KV缓存内存减少了62%,实现了1.67倍的预填充加速和1.41倍的解码加速。