Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an abundance of both compute and data, which grows with the size of the model being trained. This is infeasible both because of the large compute costs and duration associated with pre-training, and the impending scarcity of high-quality data on the web. In this work, we propose Web Rephrase Augmented Pre-training ($\textbf{WRAP}$) that uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles such as "like Wikipedia" or in "question-answer format" to jointly pre-train LLMs on real and synthetic rephrases. First, we show that using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training by $\sim3x$. At the same pre-training compute budget, it improves perplexity by more than 10% on average across different subsets of the Pile, and improves zero-shot question answer accuracy across 13 tasks by more than 2%. Second, we investigate the impact of the re-phrasing style on the performance of the model, offering insights into how the composition of the training data can impact the performance of LLMs in OOD settings. Our gains are attributed to the fact that re-phrased synthetic data has higher utility than just real data because it (i) incorporates style diversity that closely reflects downstream evaluation style, and (ii) has higher 'quality' than web-scraped data.
翻译:大型语言模型在从网络大规模抓取的、通常结构混乱、噪声多且措辞不当的数据上进行训练。当前的缩放定律表明,从这些数据中学习需要大量的计算和数据,且需随所训练模型规模的增长而增加。这不仅因预训练的巨大计算成本和时长而不可行,还因网络上高质量数据的日益稀缺。在本文中,我们提出网络改写增强预训练($\textbf{WRAP}$),该方法利用现成的指令调优模型,按特定风格(如“类似维基百科”或“问答格式”)对网络文档进行改写,从而在真实数据和合成改写数据上联合预训练LLMs。首先,我们证明在自然带有噪声的C4数据集上使用WRAP可将预训练速度提升约$\sim3x$倍。在相同的预训练计算预算下,它在Pile数据集的多个子集上平均将困惑度降低超过10%,并在13个任务中将零样本问答准确率提升超过2%。其次,我们研究改写风格对模型性能的影响,揭示训练数据组成如何影响LLMs在分布外设置中的表现。我们的增益归因于改写后的合成数据比仅用真实数据具有更高的效用,因为它(i)包含了密切反映下游评估风格的风格多样性,以及(ii)比网络抓取数据具有更高的“质量”。