Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based language models. However, these models fail to generalize past the sequence length they were trained on. We present YaRN (Yet another RoPE extensioN method), a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow, while also surpassing previous the state-of-the-art at context window extension. In addition, we demonstrate that YaRN exhibits the capability to extrapolate beyond the limited context of a fine-tuning dataset. We publish the checkpoints of Llama 2 7B/13B fine-tuned using YaRN with 64k and 128k context windows at https://github.com/jquesnelle/yarn
翻译:旋转位置编码(RoPE)已被证明能有效编码基于Transformer的语言模型中的位置信息。然而,这些模型在序列长度超出训练范围时无法泛化。本文提出YaRN( Yet another RoPE extensioN method),一种计算高效的方法,用于扩展此类模型的上下文窗口,其所需的训练token数比先前方法少10倍,训练步数少2.5倍。使用YaRN,我们证明LLaMA模型能够有效利用并外推至远超原始预训练长度的上下文窗口,同时在上下文窗口扩展方面超越了先前的最优技术。此外,我们展示了YaRN具备外推至微调数据集有限上下文之外的能力。我们发布了使用YaRN微调、具有64k和128k上下文窗口的Llama 2 7B/13B模型检查点,下载地址为:https://github.com/jquesnelle/yarn