Measuring similarity between training examples is critical for curating high-quality and diverse pretraining datasets for language models. However, similarity is typically computed with a generic off-the-shelf embedding model that has been trained for tasks such as retrieval. Whether these embedding-based similarity metrics are well-suited for pretraining data selection remains largely unexplored. In this paper, we propose a new framework to assess the suitability of a similarity metric specifically for data curation in language model pretraining applications. Our framework's first evaluation criterion captures how well distances reflect generalization in pretraining loss between different training examples. Next, we use each embedding model to guide a standard diversity-based data curation algorithm and measure its utility by pretraining a language model on the selected data and evaluating downstream task performance. Finally, we evaluate the capabilities of embeddings to distinguish between examples from different data sources. With these evaluations, we demonstrate that standard off-the-shelf embedding models are not well-suited for the pretraining data curation setting, underperforming even remarkably simple embeddings that are extracted from models trained on the same pretraining corpus. Our experiments are performed on the Pile, for pretraining a 1.7B parameter language model on 200B tokens. We believe our analysis and evaluation framework serves as a foundation for the future design of embeddings that specifically reason about similarity in pretraining datasets.
翻译:衡量训练样本之间的相似性对于构建高质量且多样化的语言模型预训练数据集至关重要。然而,相似性通常使用为检索等任务训练的通用现成嵌入模型进行计算。这些基于嵌入的相似性度量是否适用于预训练数据选择在很大程度上仍未得到探索。在本文中,我们提出了一个新的框架,用于评估相似性度量在语言模型预训练应用的数据筛选中的适用性。我们框架的第一个评估标准捕捉了距离在多大程度上反映了不同训练样本之间预训练损失的泛化差异。接下来,我们使用每个嵌入模型来指导标准的基于多样性的数据筛选算法,并通过在选定数据上预训练语言模型并评估下游任务性能来衡量其效用。最后,我们评估了嵌入区分来自不同数据源的样本的能力。通过这些评估,我们证明了标准的现成嵌入模型并不适用于预训练数据筛选场景,其表现甚至不及从在同一预训练语料上训练的模型中提取的极其简单的嵌入。我们的实验在Pile数据集上进行,用于在200B个词元上预训练一个17亿参数的语言模型。我们相信,我们的分析和评估框架为未来专门针对预训练数据集相似性推理的嵌入设计奠定了基础。