Scaling the rotary position embedding (RoPE) has become a common method for extending the context window of RoPE-based large language models (LLMs). However, existing scaling methods often rely on empirical approaches and lack a profound understanding of the internal distribution within RoPE, resulting in suboptimal performance in extending the context window length. In this paper, we propose to optimize the context window extending task from the view of rotary angle distribution. Specifically, we first estimate the distribution of the rotary angles within the model and analyze the extent to which length extension perturbs this distribution. Then, we present a novel extension strategy that minimizes the disturbance between rotary angle distributions to maintain consistency with the pre-training phase, enhancing the model's capability to generalize to longer sequences. Experimental results compared to the strong baseline methods demonstrate that our approach reduces by up to 72% of the distributional disturbance when extending LLaMA2's context window to 8k, and reduces by up to 32% when extending to 16k. On the LongBench-E benchmark, our method achieves an average improvement of up to 4.33% over existing state-of-the-art methods. Furthermore, Our method maintains the model's performance on the Hugging Face Open LLM benchmark after context window extension, with only an average performance fluctuation ranging from -0.12 to +0.22.
翻译:扩展旋转位置编码(RoPE)已成为扩展基于RoPE的大语言模型(LLM)上下文窗口的常用方法。然而,现有的扩展方法通常依赖于经验性方法,缺乏对RoPE内部分布的深刻理解,导致在扩展上下文窗口长度时性能欠佳。本文提出从旋转角度分布的角度优化上下文窗口扩展任务。具体而言,我们首先估计模型内旋转角度的分布,并分析长度扩展对该分布扰动的程度。然后,我们提出一种新颖的扩展策略,该策略最小化旋转角度分布之间的扰动,以保持与预训练阶段的一致性,从而增强模型泛化到更长序列的能力。与强基线方法相比的实验结果表明,在将LLaMA2的上下文窗口扩展到8k时,我们的方法将分布扰动降低了高达72%,扩展到16k时降低了高达32%。在LongBench-E基准测试中,我们的方法相比现有最先进方法平均提升了高达4.33%。此外,我们的方法在上下文窗口扩展后,在Hugging Face Open LLM基准测试中保持了模型的性能,平均性能波动仅在-0.12至+0.22之间。