Pretraining data selection has the potential to improve language model pretraining efficiency by utilizing higher-quality data from massive web data corpora. Current data selection methods, which rely on either hand-crafted rules or larger reference models, are conducted statically and do not capture the evolving data preferences during pretraining. In this paper, we introduce model-aware data selection with data influence models (MATES), where a data influence model continuously adapts to the evolving data preferences of the pretraining model and then selects the data most effective for the current pretraining progress. Specifically, we fine-tune a small data influence model to approximate oracle data preference signals collected by locally probing the pretraining model and to select data accordingly for the next pretraining stage. Experiments on Pythia and the C4 dataset demonstrate that MATES significantly outperforms random data selection on extensive downstream tasks in both zero- and few-shot settings. It doubles the gains achieved by recent data selection approaches that leverage larger reference models and reduces the total FLOPs required to reach certain performances by half. Further analysis validates the ever-changing data preferences of pretraining models and the effectiveness of our data influence models to capture them. Our code is open-sourced at https://github.com/cxcscmu/MATES.
翻译:预训练数据选择旨在通过从海量网络数据中筛选高质量数据来提升语言模型的预训练效率。当前依赖手工规则或更大参考模型的数据选择方法均采用静态策略,无法捕捉预训练过程中动态变化的数据偏好。本文提出基于数据影响模型的模型感知数据选择方法(MATES),其中数据影响模型能持续适配预训练模型不断演变的数据偏好,并为当前预训练阶段选取最有效的数据。具体而言,我们通过微调小型数据影响模型来逼近通过局部探测预训练模型获取的参考数据偏好信号,从而为下一预训练阶段选择相应数据。在Pythia模型和C4数据集上的实验表明,MATES在零样本和少样本场景的广泛下游任务中均显著优于随机数据选择方法。该方法不仅将近期利用更大参考模型的数据选择方法所取得的增益翻倍,还将达到特定性能所需的总计算量减半。进一步分析验证了预训练模型数据偏好的动态变化特性,以及我们数据影响模型对此类偏好捕捉的有效性。我们的代码已在https://github.com/cxcscmu/MATES开源。