The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain. DoReMi improves average few-shot downstream accuracy by 6.5% points over a baseline model trained using The Pile's default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks.
翻译:预训练数据领域(如维基百科、书籍、网络文本)的混合比例对语言模型性能有显著影响。本文提出基于极小极大优化的领域重加权方法(DoReMi),该方法首先利用群体分布鲁棒优化(Group DRO)在领域层面训练一个小型代理模型,无需下游任务知识即可生成领域权重(混合比例)。随后根据这些权重对数据集进行重采样,并训练一个更大的完整模型。在我们的实验中,基于2.8亿参数的代理模型使用DoReMi确定领域权重,以更高效地训练80亿参数(30倍规模)的完整模型。在The Pile数据集上,即使对某些领域降低权重,DoReMi仍能改善所有领域的困惑度。相较于使用The Pile默认领域权重的基线模型,DoReMi将平均少样本下游准确率提升6.5个百分点,并且仅需2.6倍更少的训练步数即可达到基线准确率。在GLaM数据集上,无需下游任务知识的DoReMi甚至匹配了基于下游任务调优领域权重的性能。