The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.

翻译：大型语言模型（LLM）的性能在很大程度上取决于其预训练数据集的质量与规模。然而，当前最先进的开源LLM（如Llama 3和Mixtral）的预训练数据集并未公开，且其构建方法鲜为人知。本研究提出了FineWeb——一个从96个Common Crawl快照中提取的、包含15万亿词元的预训练数据集，其训练出的LLM性能优于其他开源预训练数据集。为深入理解如何构建高质量预训练数据集，我们详细记录并分析了FineWeb构建过程中的所有设计选择，包括对去重和过滤策略的深度研究。此外，我们还推出了FineWeb-Edu——一个从FineWeb中筛选出的、包含1.3万亿词元的教育类文本集合。基于FineWeb-Edu预训练的LLM在MMLU和ARC等知识与推理密集型基准测试中展现出显著更优的性能。我们同步公开了数据集、数据整理代码库以及消融实验中训练的所有模型。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日