Selecting a suitable pretraining dataset is crucial for both general-domain (e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We formalize this problem as selecting a subset of a large raw unlabeled dataset to match a desired target distribution given unlabeled target samples. Due to the scale and dimensionality of the raw text data, existing methods use simple heuristics or require human experts to manually curate data. Instead, we extend the classic importance resampling approach used in low-dimensions for LM data selection. We propose Data Selection with Importance Resampling (DSIR), an efficient and scalable framework that estimates importance weights in a reduced feature space for tractability and selects data with importance resampling according to these weights. We instantiate the DSIR framework with hashed n-gram features for efficiency, enabling the selection of 100M documents from the full Pile dataset in 4.5 hours. To measure whether hashed n-gram features preserve the aspects of the data that are relevant to the target, we define KL reduction, a data metric that measures the proximity between the selected pretraining data and the target on some feature space. Across 8 data selection methods (including expert selection), KL reduction on hashed n-gram features highly correlates with average downstream accuracy (r=0.82). When selecting data for continued pretraining on a specific domain, DSIR performs comparably to expert curation across 8 target distributions. When pretraining general-domain models (target is Wikipedia and books), DSIR improves over random selection and heuristic filtering baselines by 2-2.5% on the GLUE benchmark. Code is available at https://github.com/p-lambda/dsir.
翻译:选择合适的预训练数据集对于通用型语言模型(例如GPT-3)和领域专用型语言模型(例如Codex)都至关重要。我们将此问题形式化为:在给定无标签目标样本的情况下,从大规模原始无标签数据集中选择一个子集,使其与期望的目标分布相匹配。由于原始文本数据的规模和维度,现有方法通常采用简单启发式规则或依赖领域专家人工筛选数据。为此,我们将在低维数据选择中应用的经典重要性重采样方法扩展到语言模型数据选择领域。我们提出基于重要性重采样的数据选择方法(DSIR),这是一种高效且可扩展的框架:通过将数据映射到低维特征空间来估计重要性权重以保证计算可行性,再根据这些权重使用重要性重采样选取数据。我们采用哈希n-gram特征实例化DSIR框架以实现高效率,可在4.5小时内从完整的Pile数据集中筛选出1亿篇文档。为检验哈希n-gram特征能否保留与目标相关的数据特征,我们定义了KL缩减度量——该指标通过特征空间衡量选定预训练数据与目标分布之间的接近程度。在8种数据选择方法(包括专家筛选)中,基于哈希n-gram特征的KL缩减与平均下游任务准确率高度相关(r=0.82)。针对特定领域的持续预训练数据选择,DSIR在8种目标分布下的表现与专家筛选相当。在训练通用型模型(以维基百科和书籍为目标分布)时,DSIR在GLUE基准测试中相比随机选择和基于启发式规则的基线方法提升了2-2.5%。代码已开源:https://github.com/p-lambda/dsir。