Data-Juicer: A One-Stop Data Processing System for Large Language Models

The immense evolution in Large Language Models (LLMs) has underscored the importance of massive, diverse, and high-quality data. Despite this, existing open-source tools for LLM data processing remain limited and mostly tailored to specific datasets, with an emphasis on the reproducibility of released data over adaptability and usability, inhibiting potential applications. In response, we propose a one-stop, powerful yet flexible and user-friendly LLM data processing system named Data-Juicer. Our system offers over 50 built-in versatile operators and pluggable tools, which synergize modularity, composability, and extensibility dedicated to diverse LLM data processing needs. By incorporating visualized and automatic evaluation capabilities, Data-Juicer enables a timely feedback loop to accelerate data processing and gain data insights. To enhance usability, Data-Juicer provides out-of-the-box components for users with various backgrounds, and fruitful data recipes for LLM pre-training and post-tuning usages. Further, we employ multi-facet system optimization and seamlessly integrate Data-Juicer with both LLM and distributed computing ecosystems, to enable efficient and scalable data processing. Empirical validation of the generated data recipes reveals considerable improvements in LLaMA performance for various pre-training and post-tuning cases, demonstrating up to 7.45% relative improvement of averaged score across 16 LLM benchmarks and 16.25% higher win rate using pair-wise GPT-4 evaluation. The system's efficiency and scalability are also validated, supported by up to 88.7% reduction in single-machine processing time, 77.1% and 73.1% less memory and CPU usage respectively, and 7.91x processing acceleration when utilizing distributed computing ecosystems. Our system, data recipes, and multiple tutorial demos are released, calling for broader research centered on LLM data.

翻译：大语言模型的迅猛发展凸显了海量、多样且高质量数据的重要性。然而，现有的开源LLM数据处理工具仍存在局限，大多针对特定数据集定制，侧重于发布数据的可复现性而非适应性与易用性，限制了潜在应用。为此，我们提出了一种功能强大、灵活易用的一站式LLM数据处理系统——Data-Juicer。该系统内置50多个多功能算子与可插拔工具，通过模块化、可组合性与可扩展性的协同设计，满足多样化的LLM数据处理需求。结合可视化与自动评估能力，Data-Juicer能够实现及时反馈循环，加速数据处理并获取数据洞察。为提升易用性，Data-Juicer为不同背景的用户提供开箱即用的组件，以及面向LLM预训练与后调优的丰富数据配方。此外，我们采用多维度系统优化，将Data-Juicer与LLM及分布式计算生态系统无缝集成，实现高效可扩展的数据处理。基于生成数据配方的实证验证表明，在多种预训练与后调优场景下，LLaMA性能获得显著提升：在16项LLM基准测试中平均得分相对提升达7.45%，基于成对GPT-4评估的胜率提高16.25%。系统效率与可扩展性亦得到验证，单机处理时间降低88.7%，内存与CPU使用率分别减少77.1%和73.1%，利用分布式计算生态系统可实现7.91倍的处理加速。我们已开源该系统、数据配方及多项教程示例，旨在推动以LLM数据为核心的更广泛研究。