We present Position Interpolation (PI) that extends the context window sizes of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps), while demonstrating strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B. Meanwhile, the extended model by Position Interpolation preserve quality relatively well on tasks within its original context window. To achieve this goal, Position Interpolation linearly down-scales the input position indices to match the original context window size, rather than extrapolating beyond the trained context length which may lead to catastrophically high attention scores that completely ruin the self-attention mechanism. Our theoretical study shows that the upper bound of interpolation is at least $\sim 600 \times$ smaller than that of extrapolation, further demonstrating its stability. Models extended via Position Interpolation retain its original architecture and can reuse most pre-existing optimization and infrastructure.
翻译:我们提出位置插值(Position Interpolation,PI)方法,可将基于RoPE的预训练大语言模型(如LLaMA模型)的上下文窗口大小扩展至32768,且仅需最小限度的微调(1000步内),同时在需要长上下文的各项任务(包括密码检索、语言建模以及从LLaMA 7B至65B的长文档摘要)中展现出强大的实证结果。此外,通过位置插值扩展的模型在原始上下文窗口内的任务上也能较好地保持质量。为实现这一目标,位置插值对输入位置索引进行线性下采样以匹配原始上下文窗口大小,而非在训练上下文长度之外进行外推——后者可能导致注意力分数灾难性过高,完全破坏自注意力机制。我们的理论研究表明,插值的上界至少比外推小约$\sim 600$倍,进一步证明了其稳定性。通过位置插值扩展的模型保留原始架构,且可重复使用绝大多数已有的优化方法和基础设施。