Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based language models. However, these models fail to generalize past the sequence length they were trained on. We present YaRN (Yet another RoPE extensioN method), a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow, while also surpassing previous the state-of-the-art at context window extension. In addition, we demonstrate that YaRN exhibits the capability to extrapolate beyond the limited context of a fine-tuning dataset. The models fine-tuned using YaRN has been made available and reproduced online up to 128k context length at https://github.com/jquesnelle/yarn
翻译:旋转位置编码(RoPE)已被证明能有效编码基于Transformer的语言模型中的位置信息。然而,这些模型无法泛化至其训练序列长度之外。我们提出YaRN(Yet another RoPE extensioN method,又一种RoPE扩展方法),这是一种计算高效的方法,用于扩展此类模型的上下文窗口,其所需令牌数比先前方法少10倍,训练步骤少2.5倍。使用YaRN,我们证明LLaMA模型能够有效利用并外推至远超其原始预训练允许的上下文长度,同时在上下文窗口扩展方面超越先前的最优水平。此外,我们证明YaRN展现出外推至微调数据集有限上下文之外的能力。通过YaRN微调的模型已公开可用,并已在线复现至128k上下文长度,地址为https://github.com/jquesnelle/yarn。