Transformer-based Large Language Models (LLMs) are pioneering advances in many natural language processing tasks, however, their exceptional capabilities are restricted within the preset context window of Transformer. Position Embedding (PE) scaling methods, while effective in extending the context window to a specific length, demonstrate either notable limitations in their extrapolation abilities or sacrificing partial performance within the context window. Length extrapolation methods, although theoretically capable of extending the context window beyond the training sequence length, often underperform in practical long-context applications. To address these challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We generalise the PE scaling approaches to model the continuous dynamics by ordinary differential equations over the length scaling factor, thereby overcoming the constraints of current PE scaling methods designed for specific lengths. Moreover, by extending the dynamics to desired context lengths beyond the training sequence length, CLEX facilitates the length extrapolation with impressive performance in practical tasks. We demonstrate that CLEX can be seamlessly incorporated into LLMs equipped with Rotary Position Embedding, such as LLaMA and GPT-NeoX, with negligible impact on training and inference latency. Experimental results reveal that CLEX can effectively extend the context window to over 4x or almost 8x training length, with no deterioration in performance. Furthermore, when evaluated on the practical LongBench benchmark, our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k. Our code is available at https://github.com/DAMO-NLP-SG/CLEX.
翻译:基于Transformer的大型语言模型(LLMs)在诸多自然语言处理任务中引领了技术进步,然而其卓越能力受限于Transformer预设的上下文窗口。位置嵌入(Position Embedding, PE)缩放方法虽能将上下文窗口有效扩展至特定长度,但在外推能力上存在显著局限,或需牺牲上下文窗口内的部分性能。长度外推方法理论上可将上下文窗口扩展至超越训练序列长度,但在实际长上下文应用中常表现欠佳。为解决这些挑战,我们提出面向LLMs的连续长度外推(Continuous Length EXtrapolation, CLEX)。通过将PE缩放方法泛化为基于长度缩放因子的常微分方程以建模连续动态,CLEX克服了现有PE缩放方法仅针对特定长度的限制。此外,通过将动态过程扩展至超越训练序列长度的目标上下文长度,CLEX可在实际任务中实现性能卓越的长度外推。实验表明,CLEX可无缝集成至配备旋转位置嵌入(Rotary Position Embedding)的LLMs(如LLaMA与GPT-NeoX),且对训练与推理延迟的影响可忽略不计。结果显示,CLEX能有效将上下文窗口扩展至训练长度的4倍甚至近8倍,且性能无衰减。在实用LongBench基准测试中,我们的模型(基于4k长度训练)与基于最长32k上下文长度训练的最先进开源模型相比,展现了具有竞争力的性能。代码已开源至https://github.com/DAMO-NLP-SG/CLEX。