Through pretraining on a corpus with various sources, Large Language Models (LLMs) have gained impressive performance. However, the impact of each component of the pretraining corpus remains opaque. As a result, the organization of the pretraining corpus is still empirical and may deviate from the optimal. To address this issue, we systematically analyze the impact of 48 datasets from 5 major categories of pretraining data of LLMs and measure their impacts on LLMs using benchmarks about nine major categories of model capabilities. Our analyses provide empirical results about the contribution of multiple corpora on the performances of LLMs, along with their joint impact patterns, including complementary, orthogonal, and correlational relationships. We also identify a set of ``high-impact data'' such as Books that is significantly related to a set of model capabilities. These findings provide insights into the organization of data to support more efficient pretraining of LLMs.
翻译:通过在多源语料库上进行预训练,大语言模型(LLMs)已展现出卓越性能。然而,预训练语料库各组成部分的具体影响仍不明确。因此,当前预训练语料库的组织方式仍依赖经验判断,可能偏离最优配置。为解决这一问题,我们系统分析了LLMs预训练数据中5大类别共48个数据集的影响,并通过9大类模型能力的基准测试评估其作用。我们的分析提供了多语料库对LLMs性能贡献的实证结果,同时揭示了其联合作用模式,包括互补关系、正交关系与相关关系。我们还识别出一组"高影响力数据",例如与多种模型能力显著相关的图书类数据。这些发现为优化数据组织以支持更高效的LLMs预训练提供了重要启示。