A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts

Training and serving long-context large language models (LLMs) incurs substantial overhead. To address this, two critical steps are often required: a pretrained LLM typically undergoes a separate stage for context length extension by training on long-context data, followed by architectural modifications to reduce the overhead of KV cache during serving. This paper argues that integrating length extension with a GPU-friendly KV cache reduction architecture not only reduces training overhead during length extension, but also achieves better long-context performance. This leads to our proposed LongGen, which finetunes a pretrained LLM into an efficient architecture during length extension. LongGen builds on three key insights: (1) Sparse attention patterns, such as window attention (attending to recent tokens), attention sink (initial ones), and blockwise sparse attention (strided token blocks) are well-suited for building efficient long-context models, primarily due to their GPU-friendly memory access patterns, enabling efficiency gains not just theoretically but in practice as well. (2) It is essential for the model to have direct access to all tokens. A hybrid architecture with 1/3 full attention layers and 2/3 efficient ones achieves a balanced trade-off between efficiency and long-context performance. (3) Lightweight training on 5B long-context data is sufficient to extend the hybrid model's context length from 4K to 128K. We evaluate LongGen on both Llama-2 7B and Llama-2 70B, demonstrating its effectiveness across different scales. During training with 128K-long contexts, LongGen achieves 1.55x training speedup and reduces wall-clock time by 36%, compared to a full-attention baseline. During inference, LongGen reduces KV cache memory by 62%, achieving 1.67x prefilling speedup and 1.41x decoding speedup.

翻译：训练和服务长上下文大语言模型（LLMs）会产生显著开销。为解决此问题，通常需要两个关键步骤：预训练的LLM通常需经过一个独立的上下文长度扩展阶段，通过长上下文数据进行训练，随后进行架构修改以降低服务过程中KV缓存的开销。本文认为，将长度扩展与GPU友好的KV缓存缩减架构相结合，不仅能减少长度扩展期间的训练开销，还能实现更好的长上下文性能。这促使我们提出了LongGen，它在长度扩展过程中将预训练的LLM微调为一种高效架构。LongGen基于三个关键洞见：（1）稀疏注意力模式，如窗口注意力（关注近期token）、注意力汇聚（初始token）和分块稀疏注意力（跨步token块），非常适合构建高效的长上下文模型，这主要得益于其GPU友好的内存访问模式，使得效率提升不仅在理论上可行，在实践中也能实现。（2）模型必须能够直接访问所有token。采用1/3全注意力层与2/3高效注意力层的混合架构，能在效率与长上下文性能之间实现平衡。（3）在5B长上下文数据上进行轻量级训练，足以将混合模型的上下文长度从4K扩展到128K。我们在Llama-2 7B和Llama-2 70B上评估LongGen，证明了其在不同规模下的有效性。在使用128K长上下文进行训练时，与全注意力基线相比，LongGen实现了1.55倍的训练加速，并将挂钟时间减少了36%。在推理过程中，LongGen将KV缓存内存减少了62%，实现了1.67倍的预填充加速和1.41倍的解码加速。