In large-scale LLM pre-training systems with 100k+ GPUs, failures become the norm rather than the exception, and restart costs can dominate wall-clock training time. However, existing fault-tolerance mechanisms are largely unprepared for this restart-dominant regime. To address this challenge, we propose SPARe - Stacked Parallelism with Adaptive Reordering - a fault-tolerance framework that masks node failures during gradient synchronization by stacking redundant data shards across parallelism groups and adaptively reordering execution. SPARe achieves availability comparable to traditional replication while maintaining near-constant computation overhead of only 2~3x, even under high redundancy where traditional replication would require linearly inflating overhead. We derive closed-form expressions for endurable failure count and computation overhead, validate them via SimGrid-based discrete-event simulation, and jointly optimize redundancy and checkpointing to minimize time-to-train. At extreme scale with up to 600k GPUs, SPARe reduces time-to-train by 40~50% compared to traditional replication.
翻译:在10万+GPU规模的大规模大语言模型预训练系统中,故障已成为常态而非例外,且重启开销可能主导实际训练时间。然而现有容错机制对此类重启主导场景的应对能力严重不足。针对该挑战,我们提出SPARe——一种通过跨并行组堆叠冗余数据分片并自适应重排序执行流程,在梯度同步过程中屏蔽节点故障的容错框架。SPARe在保持仅2~3倍近恒定计算开销的同时,实现了与传统复制相当的高可用性,即使在传统复制需线性膨胀开销的高冗余场景下依然有效。我们推导了可容忍故障数量与计算开销的闭式表达式,通过基于SimGrid的离散事件仿真进行验证,并联合优化冗余度与检查点机制以最小化训练时间。在高达60万GPU的极端规模下,相较于传统复制,SPARe可将训练时间降低40%~50%。