The immense evolution in Large Language Models (LLMs) has underscored the importance of massive, heterogeneous, and high-quality data. A data recipe is a mixture of data from different sources for training LLMs, which plays a vital role in LLMs' performance. Existing open-source tools for LLM data processing are mostly tailored for specific data recipes. To continuously uncover the potential of LLMs, incorporate data from new sources, and improve LLMs' performance, we build a new system named Data-Juicer, with which we can efficiently generate diverse data recipes, explore different possibilities in forming data mixtures, and evaluate their effects on model performance. Different from traditional data-analytics pipelines, Data-Juicer faces some unique challenges. Firstly, the possible data sources for forming data recipes are truly heterogeneous and massive with various qualities. Secondly, it is extremely expensive to precisely evaluate data recipes' impact on LLMs' performance. Thirdly, the end users of Data-Juicer, model developers, need sufficient flexibility to configure and evaluate different data recipes. Data-Juicer features a fine-grained abstraction of pipelines for constructing data recipes, with over 50 built-in operators for easy composition and extension. By incorporating visualization and auto-evaluation capabilities, Data-Juicer enables a timely feedback loop for both LLM pre-training and fine-tuning. Further, Data-Juicer is optimized and integrated with ecosystems for LLM training, evaluation, and distributed computing. The data recipes derived with Data-Juicer gain notable improvements on state-of-the-art LLMs, by up to 7.45% increase in averaged score across 16 LLM benchmarks and 17.5% higher win rate in pair-wise GPT-4 evaluations. Our system, data recipes, and tutorials are released, calling for broader data-centric research on training and understanding LLMs.
翻译:大语言模型的迅猛发展凸显了大规模、异构和高质量数据的重要性。数据配方是指用于训练大语言模型的混合数据源组合,对大语言模型性能起到关键作用。现有开源大语言模型数据处理工具大多针对特定数据配方设计。为持续发掘大语言模型潜力、整合新数据源并提升模型性能,我们构建了名为Data-Juicer的新系统,可高效生成多样化的数据配方,探索数据混合的不同可能性,并评估其对模型性能的影响。不同于传统数据分析流水线,Data-Juicer面临三大独特挑战:其一,构成数据配方的潜在数据源质量参差不齐且具有高度异构性;其二,精确评估数据配方对模型性能的影响代价极其高昂;其三,作为终端用户的模型开发者需要足够灵活性来配置和评估不同数据配方。Data-Juicer独创了构建数据配方的细粒度流水线抽象架构,内置50余个可组合扩展的算子。通过集成可视化与自动评估功能,该系统为大语言模型的预训练与微调提供了即时反馈机制。此外,Data-Juicer经过优化并与大语言模型训练、评估及分布式计算生态系统深度整合。采用Data-Juicer生成的数据配方在先进大语言模型上取得显著提升:在16项大语言模型基准测试中平均分数最高提升7.45%,成对GPT-4评估胜率提升17.5%。我们已开源系统代码、数据配方及教程,旨在推动面向大语言模型训练与理解的数据中心化研究。