Data plays a fundamental role in the training of Large Language Models (LLMs). Effective data management, particularly in the formulation of a well-suited training dataset, holds significance for enhancing model performance and improving training efficiency during pretraining and supervised fine-tuning phases. Despite the considerable importance of data management, the current research community still falls short in providing a systematic analysis of the rationale behind management strategy selection, its consequential effects, methodologies for evaluating curated datasets, and the ongoing pursuit of improved strategies. Consequently, the exploration of data management has attracted more and more attention among the research community. This survey provides a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs, covering various noteworthy aspects of data management strategy design: data quantity, data quality, domain/task composition, etc. Looking toward the future, we extrapolate existing challenges and outline promising directions for development in this field. Therefore, this survey serves as a guiding resource for practitioners aspiring to construct powerful LLMs through effective data management practices. The collection of the latest papers is available at https://github.com/ZigeW/data_management_LLM.
翻译:数据在大语言模型(LLMs)训练中扮演着根本性角色。有效的数据管理,特别是在构建合适训练数据集方面,对于提升预训练和监督微调阶段的模型性能及训练效率具有重要意义。尽管数据管理极为重要,当前研究界在系统分析管理策略选择的逻辑依据、其后续影响、评估整理数据集的方法论以及持续优化策略的探索上仍显不足。因此,数据管理探索已日益受到研究界的重视。本综述全面梳理了LLMs预训练和监督微调阶段数据管理的现有研究,涵盖了数据管理策略设计的多个关键方面:数据量、数据质量、领域/任务组成等。展望未来,我们推演了现有挑战,并概述了该领域具有前景的发展方向。因此,本综述可作为实践者通过有效数据管理实践构建强大LLMs的指导性资源。最新论文汇编可参阅https://github.com/ZigeW/data_management_LLM。