Analyzing Similarity Metrics for Data Selection for Language Model Pretraining

Measuring similarity between training examples is critical for curating high-quality and diverse pretraining datasets for language models. However, similarity is typically computed with a generic off-the-shelf embedding model that has been trained for tasks such as retrieval. Whether these embedding-based similarity metrics are well-suited for pretraining data selection remains largely unexplored. In this paper, we propose a new framework to assess the suitability of a similarity metric specifically for data curation in language model pretraining applications. Our framework's first evaluation criterion captures how well distances reflect generalization in pretraining loss between different training examples. Next, we use each embedding model to guide a standard diversity-based data curation algorithm and measure its utility by pretraining a language model on the selected data and evaluating downstream task performance. Finally, we evaluate the capabilities of embeddings to distinguish between examples from different data sources. With these evaluations, we demonstrate that standard off-the-shelf embedding models are not well-suited for the pretraining data curation setting, underperforming even remarkably simple embeddings that are extracted from models trained on the same pretraining corpus. Our experiments are performed on the Pile, for pretraining a 1.7B parameter language model on 200B tokens. We believe our analysis and evaluation framework serves as a foundation for the future design of embeddings that specifically reason about similarity in pretraining datasets.

翻译：衡量训练样本之间的相似性对于构建高质量且多样化的语言模型预训练数据集至关重要。然而，相似性通常使用为检索等任务训练的通用现成嵌入模型进行计算。这些基于嵌入的相似性度量是否适用于预训练数据选择在很大程度上仍未得到探索。在本文中，我们提出了一个新的框架，用于评估相似性度量在语言模型预训练应用的数据筛选中的适用性。我们框架的第一个评估标准捕捉了距离在多大程度上反映了不同训练样本之间预训练损失的泛化差异。接下来，我们使用每个嵌入模型来指导标准的基于多样性的数据筛选算法，并通过在选定数据上预训练语言模型并评估下游任务性能来衡量其效用。最后，我们评估了嵌入区分来自不同数据源的样本的能力。通过这些评估，我们证明了标准的现成嵌入模型并不适用于预训练数据筛选场景，其表现甚至不及从在同一预训练语料上训练的模型中提取的极其简单的嵌入。我们的实验在Pile数据集上进行，用于在200B个词元上预训练一个17亿参数的语言模型。我们相信，我们的分析和评估框架为未来专门针对预训练数据集相似性推理的嵌入设计奠定了基础。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日