High-quality training data is essential to large language models (LLMs) and typically requires extensive and costly manual curation. Existing automatic data preparation methods rely on predefined pipelines or customized human instructions, which limits their adaptability to diverse data distributions and lacks principled guidance from high-quality examples. In this paper, we introduce DataEvolver, the first self-evolving data preparation system that automatically constructs pipelines to transform raw data into high-quality data. DataEvolver employs a multi-level mechanism to ensure both pipeline executability and effectiveness. At the operator level, it incrementally expands the operator set to construct a logical plan while resolving dependency conflicts. At the pipeline level, it instantiates logical plans into executable code and iteratively refines pipeline orchestration through a feedback loop that reduces the distribution gap between prepared data and high-quality examples. Experiments on seven benchmarks show that DataEvolver substantially improves data quality and achieves an average 10\% gain in downstream LLM performance compared with training on original data, highlighting new opportunities for the iterative co-evolution of LLMs and data.
翻译:摘要:高质量训练数据对大语言模型至关重要,但通常需要大量且昂贵的人工标注。现有自动数据准备方法依赖预定义流程或人工指令,难以适应多样化的数据分布,且缺乏高质量示例的准则化指导。本文提出DataEvolver——首个自进化数据准备系统,可自动构建数据流水线将原始数据转化为高质量数据。该系统采用多层机制确保流水线的可执行性与有效性:在算子层级,通过增量扩展算子集构建逻辑计划并解决依赖冲突;在流水线层级,将逻辑计划实例化为可执行代码,并基于反馈循环迭代优化流水线编排,以缩小生成数据与高质量示例之间的分布差距。在七个基准测试中的实验表明,相较于原始数据训练,DataEvolver显著提升数据质量,并使下游大语言模型性能平均提升10%,彰显了大语言模型与数据迭代共进的新可能。