The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.
翻译:大型语言模型(LLM)的性能在很大程度上取决于其预训练数据集的质量与规模。然而,当前最先进的开源LLM(如Llama 3和Mixtral)的预训练数据集并未公开,且其构建方法鲜为人知。本研究提出了FineWeb——一个从96个Common Crawl快照中提取的、包含15万亿词元的预训练数据集,其训练出的LLM性能优于其他开源预训练数据集。为深入理解如何构建高质量预训练数据集,我们详细记录并分析了FineWeb构建过程中的所有设计选择,包括对去重和过滤策略的深度研究。此外,我们还推出了FineWeb-Edu——一个从FineWeb中筛选出的、包含1.3万亿词元的教育类文本集合。基于FineWeb-Edu预训练的LLM在MMLU和ARC等知识与推理密集型基准测试中展现出显著更优的性能。我们同步公开了数据集、数据整理代码库以及消融实验中训练的所有模型。