The quality of foundation models depends heavily on their training data. Consequently, great efforts have been put into dataset curation. Yet most approaches rely on manual tuning of coarse-grained mixtures of large buckets of data, or filtering by hand-crafted heuristics. An approach that is ultimately more scalable (let alone more satisfying) is to \emph{learn} which data is actually valuable for training. This type of meta-learning could allow more sophisticated, fine-grained, and effective curation. Our proposed \emph{DataRater} is an instance of this idea. It estimates the value of training on any particular data point. This is done by meta-learning using `meta-gradients', with the objective of improving training efficiency on held out data. In extensive experiments across a range of model scales and datasets, we find that using our DataRater to filter data is highly effective, resulting in significantly improved compute efficiency.
翻译:基础模型的质量在很大程度上取决于其训练数据。因此,数据集精炼工作受到了广泛重视。然而,现有方法大多依赖于人工调整粗粒度的数据块混合比例,或采用人工设计的启发式规则进行过滤。一种更具可扩展性(且更令人满意)的方法是通过学习来识别哪些数据对训练真正具有价值。这类元学习方法可实现更精细、更高效的数据精炼。本文提出的DataRater正是该理念的实践——它通过元学习技术评估每个训练数据点的价值。具体而言,我们采用"元梯度"方法进行元学习,以提升模型在预留数据上的训练效率为目标。在涵盖多种模型规模与数据集的广泛实验中,使用DataRater进行数据过滤展现出卓越效果,显著提升了计算效率。