Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While training on ever-larger portions of the internet leads to consistent performance improvements, the size of these improvements diminishes with scale, and there has been little work exploring the effect of data selection on pre-training and downstream performance beyond simple de-duplication methods such as MinHash. Here, we show that careful data selection (on top of de-duplicated data) via pre-trained model embeddings can speed up training (20% efficiency gains) and improves average downstream accuracy on 16 NLP tasks (up to 2%) at the 6.7B model scale. Furthermore, we show that repeating data intelligently consistently outperforms baseline training (while repeating random data performs worse than baseline training). Our results indicate that clever data selection can significantly improve LLM pre-training, calls into question the common practice of training for a single epoch on as much data as possible, and demonstrates a path to keep improving our models past the limits of randomly sampling web data.
翻译:近年来,越来越多的算力和数据被投入到训练大语言模型(LLMs)中,通常采用一次性学习方法,从大规模网络语料库中随机选择尽可能多的词元进行训练。尽管对互联网中更大规模的数据进行训练能带来一致的性能提升,但这种提升的幅度会随规模扩大而递减,且除了MinHash等简单去重方法外,关于数据选择对预训练和下游任务性能影响的研究鲜有涉及。本文表明,在去重数据基础上,通过预训练模型嵌入进行精细的数据选择能够加速训练(效率提升20%),并在6.7B模型规模下将16项自然语言处理任务的平均下游准确率提升最高达2%。此外,我们证明智能重复数据持续优于基线训练(而随机重复数据则表现更差)。研究结果表明,巧妙的数据选择能显著改进大语言模型预训练,对常见的单周期最大数据量训练策略提出质疑,并展示了一条突破随机采样网络数据局限、持续改进模型的路径。