The coverage and composition of the pretraining data significantly impacts the generalization ability of Large Language Models (LLMs). Despite its importance, recent LLMs still rely on heuristics and trial and error to increase or reduce the influence of data-domains. We propose DOmain reweighting with Generalization Estimation (DoGE), which optimizes the probability of sampling from each domain (domain weights) in a principled way. Our approach is a two-stage process consisting of (i) training a proxy model to obtain domain weights using a bi-level optimization algorithm; (ii) training a larger base model by sampling training domains according to the learned domain weights. In our experiments, we extensively show how DoGE improves the generalization of the base model to any target data mixture. On the SlimPajama dataset, our base model gets better perplexity and few-shot reasoning accuracies across $6$ tasks compared to baseline methods. Moreover, aiming to generalize to out-of-domain target tasks, which is unseen in the pretraining corpus (OOD domain), DoGE can effectively identify inter-domain dependencies, and consistently achieves better test perplexity on the target domain.
翻译:预训练数据的覆盖范围与组成显著影响大语言模型(LLMs)的泛化能力。尽管其重要性不言而喻,当前LLMs仍依赖启发式方法和试错机制来调整各数据领域的影响力。本文提出基于泛化估计的领域权重重分配(DoGE)方法,以原则性方式优化各领域采样概率(领域权重)。该方法包含两阶段流程:(i)通过双层优化算法训练代理模型,获得领域权重;(ii)根据学习到的领域权重采样训练数据,训练更大规模的基座模型。实验充分表明,DoGE能显著提升基座模型对任意目标数据混合的泛化能力。在SlimPajama数据集上,与基线方法相比,我们的基座模型在6项任务中均获得更优的困惑度与少样本推理准确率。此外,针对预训练语料中未见过的域外(OOD)目标任务,DoGE可有效识别跨领域依赖关系,在目标领域上持续实现更优的测试困惑度。