We present Position Interpolation (PI) that extends the context window sizes of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps), while demonstrating strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B. Meanwhile, the extended model by Position Interpolation preserve quality relatively well on tasks within its original context window. To achieve this goal, Position Interpolation linearly down-scales the input position indices to match the original context window size, rather than extrapolating beyond the trained context length which may lead to catastrophically high attention scores that completely ruin the self-attention mechanism. Our theoretical study shows that the upper bound of interpolation is at least $\sim 600 \times$ smaller than that of extrapolation, further demonstrating its stability. Models extended via Position Interpolation retain its original architecture and can reuse most pre-existing optimization and infrastructure.
翻译:我们提出位置插值(Position Interpolation,PI)方法,可将基于RoPE的预训练大语言模型(如LLaMA系列模型)的上下文窗口大小扩展至32768,仅需少量微调(1000步以内),并在需要长上下文的各类任务(包括密钥检索、语言建模以及从LLaMA 7B到65B的长文档摘要)中展现出强大的实证结果。同时,经位置插值扩展的模型在其原始上下文窗口内的任务上也能较好地保持性能。为实现这一目标,位置插值将输入位置索引线性缩小以匹配原始上下文窗口大小,而非外推至训练上下文长度之外——后者可能导致注意力得分灾难性升高,彻底破坏自注意力机制。我们的理论研究表明,插值的上界至少比外推小约600倍,进一步证明其稳定性。通过位置插值扩展的模型保留原有架构,并可复用大部分现有优化与基础设施。