Selecting a suitable pretraining dataset is crucial for both general-domain (e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We formalize this problem as selecting a subset of a large raw unlabeled dataset to match a desired target distribution given some unlabeled target samples. Due to the large scale and dimensionality of the raw text data, existing methods use simple heuristics or use experts to manually curate data. Instead, we extend the classic importance resampling approach used in low-dimensions for LM data selection. We propose Data Selection with Importance Resampling (DSIR), an efficient and scalable framework that estimates importance weights in a reduced feature space for tractability and selects data with importance resampling according to these weights. To determine an appropriate feature space, we show that KL reduction, a data metric that measures the proximity between selected pretraining data and the target in a feature space, has high correlation with average downstream accuracy (r=0.89) when computed with simple n-gram features. This motivates our instantiation of DSIR using n-gram features. When performing continued pretraining towards a specific domain, DSIR performs comparably to expert curation across 8 target distributions. When pretraining general-domain models (target is Wikipedia + books), DSIR improves over random selection and heuristic filtering baselines by 2-2.5% on the GLUE benchmark.
翻译:选择合适的预训练数据集对于通用领域(例如GPT-3)和特定领域(例如Codex)的语言模型至关重要。我们将此问题形式化为:在给定一些无标注目标样本的情况下,从大规模原始无标注数据集中选择一个子集,使其匹配期望的目标分布。由于原始文本数据规模庞大且维度极高,现有方法通常采用简单启发式策略或依赖专家手动筛选数据。为此,我们将低维经典重要性重采样方法扩展到语言模型数据选择中。我们提出了数据选择的重要性重采样框架(DSIR),该框架通过降维特征空间估计重要性权重以实现可计算性,并根据这些权重利用重要性重采样选择数据。为了确定合适的特征空间,我们证明:KL缩减量(一种衡量所选预训练数据与目标在特征空间接近程度的数据指标)在使用简单n-gram特征计算时,与平均下游任务准确率具有高度相关性(r=0.89)。这启发我们采用n-gram特征实例化DSIR。在针对特定领域进行持续预训练时,DSIR在8种目标分布上的表现与专家筛选相当。在预训练通用领域模型(目标为维基百科+书籍)时,DSIR在GLUE基准测试中相比随机选择和启发式过滤基线方法提升了2-2.5%。