This paper introduces a novel approach to enhance the capabilities of Large Language Models (LLMs) in processing and understanding extensive text sequences, a critical aspect in applications requiring deep comprehension and synthesis of large volumes of information. Recognizing the inherent challenges in extending the context window for LLMs, primarily built on Transformer architecture, we propose a new model architecture, referred to as Zebra. This architecture efficiently manages the quadratic time and memory complexity issues associated with full attention in the Transformer by employing grouped local-global attention layers. Our model, akin to a zebra's alternating stripes, balances local and global attention layers, significantly reducing computational requirements and memory consumption. Comprehensive experiments, including pretraining from scratch, continuation of long context adaptation training, and long instruction tuning, are conducted to evaluate the Zebra's performance. The results show that Zebra achieves comparable or superior performance on both short and long sequence benchmarks, while also enhancing training and inference efficiency.
翻译:本文提出了一种增强大型语言模型(LLMs)处理和理解长文本序列能力的新方法,这在需要深度理解与综合大量信息的应用中至关重要。针对基于Transformer架构的LLMs在扩展上下文窗口时面临的固有挑战,我们提出了一种名为Zebra的新型模型架构。该架构通过采用分组局部-全局注意力层,有效解决了Transformer全注意力机制中的二次时间与内存复杂度问题。我们的模型如同斑马的交错条纹一般,平衡了局部与全局注意力层,显著降低了计算需求和内存消耗。通过从零开始预训练、持续长上下文适配训练以及长指令微调等综合实验,我们评估了Zebra的性能。结果表明,Zebra在短序列与长序列基准测试中均取得了可比或更优的表现,同时提升了训练与推理效率。