Large Language Models (LLMs) are trained with a pre-defined context length, restricting their use in scenarios requiring long inputs. Previous efforts for adapting LLMs to a longer length usually requires fine-tuning with this target length (Full-length fine-tuning), suffering intensive training cost. To decouple train length from target length for efficient context window extension, we propose Positional Skip-wisE (PoSE) training that smartly simulates long inputs using a fixed context window. This is achieved by first dividing the original context window into several chunks, then designing distinct skipping bias terms to manipulate the position indices of each chunk. These bias terms and the lengths of each chunk are altered for every training example, allowing the model to adapt to all positions within target length. Experimental results show that PoSE greatly reduces memory and time overhead compared with Full-length fine-tuning, with minimal impact on performance. Leveraging this advantage, we have successfully extended the LLaMA model to 128k tokens using a 2k training context window. Furthermore, we empirically confirm that PoSE is compatible with all RoPE-based LLMs and position interpolation strategies. Notably, our method can potentially support infinite length, limited only by memory usage in inference. With ongoing progress for efficient inference, we believe PoSE can further scale the context window beyond 128k.
翻译:大语言模型(LLMs)在预定义上下文长度下进行训练,限制了其在需要长输入场景中的应用。先前为将LLMs适配至更长长度的工作通常需以目标长度进行微调(全长度微调),导致训练成本高昂。为解耦训练长度与目标长度以实现高效上下文窗口扩展,我们提出位置跳跃式(PoSE)训练方法,该方法通过固定上下文窗口智能模拟长输入。具体而言,首先将原始上下文窗口划分为若干区块,然后设计不同的跳跃偏置项来操纵每个区块的位置索引。每个训练样本中这些偏置项及区块长度均会变化,使模型能够适应目标长度内的所有位置。实验结果表明,与全长度微调相比,PoSE显著降低了内存和时间开销,且对性能影响极小。利用该优势,我们已成功使用2k训练上下文窗口将LLaMA模型扩展至128k词元。此外,我们通过实验证实PoSE与所有基于RoPE的LLMs及位置插值策略兼容。值得注意的是,该方法理论上可支持无限长度,仅受推理阶段内存使用限制。随着高效推理技术的持续进展,我们相信PoSE能将上下文窗口进一步扩展至128k以上。