A Survey on Data Selection for Language Models

Alon Albalak,Yanai Elazar,Sang Michael Xie,Shayne Longpre,Nathan Lambert,Xinyi Wang,Niklas Muennighoff,Bairu Hou,Liangming Pan,Haewon Jeong,Colin Raffel,Shiyu Chang,Tatsunori Hashimoto,William Yang Wang

from arxiv, Paper list available at https://github.com/alon-albalak/data-selection-survey

A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the amount of training required. Data selection methods aim to determine which candidate data points to include in the training dataset and how to appropriately sample from the selected data points. The promise of improved data selection methods has caused the volume of research in the area to rapidly expand. However, because deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive, few organizations have the resources for extensive data selection research. Consequently, knowledge of effective data selection practices has become concentrated within a few organizations, many of which do not openly share their findings and methodologies. To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research.

翻译：近期大型语言模型成功的关键因素之一在于利用海量且持续增长的文本数据进行无监督预训练。然而，对所有可用数据进行朴素训练可能并非最优（或不可行），因为可用文本数据的质量参差不齐。通过数据筛选减少训练所需数据量，还能降低模型训练的碳足迹与经济成本。数据选择方法旨在确定应将哪些候选数据点纳入训练数据集，以及如何从选定数据点中进行适当采样。数据选择方法的改进前景促使该领域研究规模迅速扩大。但由于深度学习主要依赖实证证据，且大规模数据实验成本高昂，仅有少数机构具备开展广泛数据选择研究的资源。这导致有效数据选择的知识集中在少数机构中，其中许多机构并未公开其研究成果与方法论。为弥合这一知识鸿沟，本文对数据选择方法及相关研究领域的现有文献进行全面综述，建立现有方法的分类体系。通过描绘当前研究图景，本工作旨在为新老研究者建立切入点，从而加速数据选择研究的进展。此外，我们在综述过程中特别关注文献中的显著空白，并在文末提出未来研究的可行方向。