Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily reliant on effective data curation. While data selection has been widely studied, the strategic data organization for enhanced training remains an underexplored area, particularly since current LLMs are often trained for only one or a few epochs. This paper systematically explores the influence of data organization on LLM training by reusing pre-computed sample-level scores originally generated for data efficiency, thereby incurring minimal additional computational overhead. We identify and formalize four key guidelines for optimizing data organization: Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity. Guided by them, we introduce two novel data ordering methods termed STR and SAW. Extensive experiments across different model scales and data sizes, encompassing both pre-training and SFT stages, validate the effectiveness of our summarized guidelines. They also demonstrate the robustness of our proposed data ordering methods in enhancing the stability and performance of LLM training. Github Link: https://github.com/microsoft/data-efficacy/
翻译:大语言模型(LLM)已深刻变革诸多领域,但其训练效率高度依赖有效的数据整理。尽管数据选择方法已得到广泛研究,但针对训练效率提升的战略性数据组织仍是一个亟待探索的领域,尤其是在当前LLM通常仅训练一个或少数轮次的背景下。本文通过复用为数据效率预先生成的样本级评分,以极小额外计算开销为代价,系统探究了数据组织对LLM训练的影响。我们识别并形式化了优化数据组织的四项关键准则:边界锐化、循环调度、课程连续性与局部多样性。基于这些准则,我们提出了两种新颖的数据排序方法:STR与SAW。涵盖不同模型尺度与数据规模的大量实验(包括预训练与SFT阶段)验证了所总结准则的有效性,同时证明了所提数据排序方法在增强LLM训练稳定性与性能方面的稳健性。GitHub链接:https://github.com/microsoft/data-efficacy/