MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

Juntao Zhao,Qi Lu,Wei Jia,Borui Wan,Lei Zuo,Junda Feng,Jianyu Jiang,Yangrui Chen,Shuaishuai Cao,Jialing He,Kaihua Jiang,Yuanzhe Hu,Shibiao Nong,Yanghua Peng,Haibin Lin,Chuan Wu

Modern frameworks for training large foundation models (LFMs) employ dataloaders in a data-parallel manner, with each loader processing a disjoint subset of training data. When preparing data for LFM training that originates from multiple, distinct sources, two fundamental challenges arise. First, due to the quadratic computational complexity of the attention operator, the non-uniform sample distribution over data-parallel ranks leads to significant workload imbalance among dataloaders, degrading the training efficiency. Second, supporting diverse data sources requires per-dataset file access states that are redundantly replicated across parallel loaders, consuming excessive memory. This also hinders dynamic data mixing (e.g., curriculum learning) and causes redundant access/memory overhead in hybrid parallelism. We present MegaScale-Data, an industrial-grade distributed data loading architecture for multisource LFMs training, with three key innovations: (1) Disaggregated data preprocessing via role-specific actors (Source Loaders/Data Constructors) to eliminate source and parallelism redundant data access and ensure multisource scalability. (2) Centralized and declarative data plane for load-time multisource orchestration, such as long-short context, multimodality, and curriculum learning. (3) Multi-level auto-partitioning and scaling mechanism for source loaders under heterogeneous preprocessing costs. We also contribute our designs and operational experience in deployment and fault tolerance. MegaScale-Data achieves up to: (1) 4.5x end-to-end training throughput improvement, and (2) 13.5x reduction in CPU memory usage.

翻译：现代大语言模型训练框架通常采用数据并行方式配置数据加载器，每个加载器处理互不重叠的训练数据子集。当需要处理来自多个不同来源的训练数据时，会产生两个核心挑战：首先，由于注意力算子具有二次方计算复杂度，数据并行等级上样本分布的非均匀性会导致各加载器间出现显著工作负载失衡，降低训练效率；其次，多数据源需要为每个数据集维护独立的文件访问状态，这些状态会在并行加载器间产生冗余复制，不仅消耗大量内存，还会妨碍动态数据混合（如课程学习），并在混合并行场景中引入冗余访问和存储开销。我们提出MegaScale-Data——面向多源大模型训练的工业级分布式数据加载架构，其包含三项关键创新：(1)通过角色差异化数据处理（源加载器/数据构造器）消除数据源与并行等级的冗余访问，保障多源可扩展性；(2)构建集中声明式数据平面，支持加载时多源编排（如长短上下文、多模态及课程学习）；(3)设计面向异构预处理代价的多层级自动分区与扩展机制。我们还贡献了系统部署与容错方面的设计方案与运行经验。MegaScale-Data可实现：(1)端到端训练吞吐量最高提升4.5倍，(2)CPU内存占用最高减少13.5倍。