Data Engineering for Scaling Language Models to 128K Context

We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitrary input locations}, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the \textit{quantity} and \textit{quality} of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize \textit{domain balance} and \textit{length upsampling}. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.

翻译：我们研究将语言模型上下文长度扩展至128K的持续预训练方法，重点关注数据工程。我们假设长上下文建模——特别是《利用任意输入位置信息的能力》——主要是在大规模预训练中已习得的能力，且该能力可通过在适当数据混合上进行轻量级持续预训练，轻松扩展至远超训练时见过的上下文长度（例如从4K扩展到128K）。我们探究持续预训练数据的《数量》与《质量》：（1）在数量方面，我们证明5亿至50亿个token足以使模型能够在128K上下文中任意位置检索信息；（2）在质量方面，我们的结果同等强调《领域平衡》与《长度上采样》。具体而言，我们发现现有工作中常见的做法——如对书籍等特定领域的长文本数据进行简单上采样——会导致次优性能，而平衡的领域混合至关重要。我们证明，使用10亿至50亿个此类token对完整模型进行持续预训练，是将语言模型上下文长度扩展至128K的有效且经济可行的策略。我们的方法超越了开源长上下文模型，并缩小了与GPT-4 128K等前沿模型的差距。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日