Selecting a suitable training dataset is crucial for both general-domain (e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We formalize this data selection problem as selecting a subset of a large raw unlabeled dataset to match a desired target distribution, given some unlabeled target samples. Due to the large scale and dimensionality of the raw text data, existing methods use simple heuristics to select data that are similar to a high-quality reference corpus (e.g., Wikipedia), or leverage experts to manually curate data. Instead, we extend the classic importance resampling approach used in low-dimensions for LM data selection. Crucially, we work in a reduced feature space to make importance weight estimation tractable over the space of text. To determine an appropriate feature space, we first show that KL reduction, a data metric that measures the proximity between selected data and the target in a feature space, has high correlation with average accuracy on 8 downstream tasks (r=0.89) when computed with simple n-gram features. From this observation, we present Data Selection with Importance Resampling (DSIR), an efficient and scalable algorithm that estimates importance weights in a reduced feature space (e.g., n-gram features in our instantiation) and selects data with importance resampling according to these weights. When training general-domain models (target is Wikipedia + books), DSIR improves over random selection and heuristic filtering baselines by 2--2.5% on the GLUE benchmark. When performing continued pretraining towards a specific domain, DSIR performs comparably to expert curated data across 8 target distributions.
翻译:选择合适的训练数据集对于通用领域(如GPT-3)和特定领域(如Codex)语言模型(LMs)都至关重要。我们将这一数据选择问题形式化为:在给定一些无标注目标样本的情况下,从一个大规模原始无标注数据集中选取子集以匹配期望的目标分布。由于原始文本数据的规模庞大且维度很高,现有方法通常采用简单启发式规则来选取与高质量参考语料库(例如维基百科)相似的数据,或依赖专家手动整理数据。相反,我们将经典的重要性重采样方法从低维领域扩展到语言模型数据选择中。关键在于,我们在降维后的特征空间中操作,使得文本空间中的重要性权重估计变得可行。为了确定合适的特征空间,我们首先证明:在使用简单n-gram特征计算时,KL缩减(一种衡量所选数据与目标在特征空间中接近程度的数据度量指标)与8个下游任务的平均准确率高度相关(r=0.89)。基于这一发现,我们提出了基于重要性重采样的数据选择方法(DSIR),这是一种高效且可扩展的算法,它在降维后的特征空间(例如我们实例化中使用的n-gram特征)中估计重要性权重,并根据这些权重通过重要性重采样选择数据。在训练通用领域模型(目标为维基百科+书籍)时,DSIR在GLUE基准测试上比随机选择和启发式过滤基线提升了2%至2.5%。当针对特定领域进行持续预训练时,DSIR在8个目标分布上的表现与专家整理的数据相当。