High-quality training data is essential to large language models (LLMs) and typically requires extensive and costly manual curation. Existing automatic data preparation methods rely on predefined pipelines or customized human instructions, which limits their adaptability to diverse data distributions and lacks principled guidance from high-quality examples. In this paper, we introduce DataEvolver, the first self-evolving data preparation system that automatically constructs pipelines to transform raw data into high-quality data. DataEvolver employs a multi-level mechanism to ensure both pipeline executability and effectiveness. At the operator level, it incrementally expands the operator set to construct a logical plan while resolving dependency conflicts. At the pipeline level, it instantiates logical plans into executable code and iteratively refines pipeline orchestration through a feedback loop that reduces the distribution gap between prepared data and high-quality examples. Experiments on seven benchmarks show that DataEvolver substantially improves data quality and achieves an average 10\% gain in downstream LLM performance compared with training on original data, highlighting new opportunities for the iterative co-evolution of LLMs and data.
翻译:摘要:高质量训练数据对于大语言模型至关重要,且通常需要大量昂贵的人工筛选过程。现有自动化数据制备方法依赖预定义流水线或定制化人工指令,这限制了其对多样化数据分布的适应性,且缺乏来自高质量样本的原则性指导。本文提出DataEvolver——首个自进化数据制备系统,可自动构建流水线将原始数据转化为高质量数据。DataEvolver采用多层级机制确保流水线的可执行性与有效性:在算子层级,系统通过增量扩展算子集合构建逻辑计划同时解决依赖冲突;在流水线层级,系统将逻辑计划实例化为可执行代码,并通过反馈循环迭代优化流水线编排,以缩小制备数据与高质量样本间的分布差距。在七个基准上的实验表明,相较于基于原始数据的训练,DataEvolver显著提升数据质量,并使下游大语言模型性能平均提升10%,为语言模型与数据的迭代协同进化开辟了新途径。