The impressive capabilities of recent language models can be largely attributed to the multi-trillion token pretraining datasets that they are trained on. However, model developers fail to disclose their construction methodology which has lead to a lack of open information on how to develop effective pretraining sets. To address this issue, we perform the first systematic study across the entire pipeline of pretraining set construction. First, we run ablations on existing techniques for pretraining set development to identify which methods translate to the largest gains in model accuracy on downstream evaluations. Then, we categorize the most widely used data source, web crawl snapshots, across the attributes of toxicity, quality, type of speech, and domain. Finally, we show how such attribute information can be used to further refine and improve the quality of a pretraining set. These findings constitute an actionable set of steps that practitioners can use to develop high quality pretraining sets.
翻译:近期语言模型令人瞩目的能力在很大程度上可归因于其训练所用的数万亿token规模的预训练数据集。然而,模型开发者通常未披露其构建方法,导致关于如何开发高效预训练集的公开信息严重缺失。为应对这一问题,我们首次对预训练集构建的全流程进行了系统性研究。首先,我们对现有预训练集开发技术进行消融实验,以确定哪些方法能在下游评估中最大程度提升模型准确率。随后,我们将最广泛使用的数据源——网络爬虫快照——按毒性、质量、言语类型及领域等属性进行分类。最后,我们展示了如何利用此类属性信息进一步优化并提升预训练集的质量。这些研究成果构成了一套可操作的步骤,可供实践者用于开发高质量的预训练数据集。