We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitrary input locations}, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the \textit{quantity} and \textit{quality} of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize \textit{domain balance} and \textit{length upsampling}. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.
翻译:我们研究将语言模型上下文长度扩展至128K的持续预训练方法,重点关注数据工程。我们假设长上下文建模——特别是《利用任意输入位置信息的能力》——主要是在大规模预训练中已习得的能力,且该能力可通过在适当数据混合上进行轻量级持续预训练,轻松扩展至远超训练时见过的上下文长度(例如从4K扩展到128K)。我们探究持续预训练数据的《数量》与《质量》:(1)在数量方面,我们证明5亿至50亿个token足以使模型能够在128K上下文中任意位置检索信息;(2)在质量方面,我们的结果同等强调《领域平衡》与《长度上采样》。具体而言,我们发现现有工作中常见的做法——如对书籍等特定领域的长文本数据进行简单上采样——会导致次优性能,而平衡的领域混合至关重要。我们证明,使用10亿至50亿个此类token对完整模型进行持续预训练,是将语言模型上下文长度扩展至128K的有效且经济可行的策略。我们的方法超越了开源长上下文模型,并缩小了与GPT-4 128K等前沿模型的差距。