As large language models scale, training them requires thousands of GPUs over extended durations--making frequent failures an inevitable reality. While checkpointing remains the primary fault-tolerance mechanism, existing methods fall short when applied to Mixture-of-Experts (MoE) models. Due to their substantially larger training state, MoE models exacerbate checkpointing overheads, often causing costly stalls or prolonged recovery that severely degrade training efficiency. We present MoEtion, a distributed, in-memory checkpointing system tailored for MoE models. MoEtion is built on three key ideas: (1) sparse checkpointing, which incrementally snapshots subsets of experts across iterations to reduce overhead; (2) a sparse-to-dense checkpoint conversion mechanism that incrementally reconstructs consistent dense checkpoints from sparse snapshots; and (3) upstream logging of activations and gradients at pipeline-stage boundaries, enabling localized recovery without re-executing unaffected workers. Evaluations across diverse MoE models with up to 64 experts show that MoEtion reduces checkpointing overhead by up to \(4\times\) and recovery overhead by up to \(31\times\) compared to state-of-the-art approaches, sustaining consistently high Effective Training Time Ratios (ETTR) of up to $\ge 0.94$ even under frequent failures (MTBF as low as 10 minutes) and delivering up to $8\times$ overall training speedup, all without compromising synchronous training semantics. Overall, MoEtion offers a robust and scalable fault-tolerance solution for the next generation of sparsely activated models.
翻译:随着大语言模型规模的扩大,其训练过程需要数千个GPU持续运行较长时间——这使得频繁故障成为不可避免的现实。虽然检查点技术仍是主要的容错机制,但现有方法应用于混合专家(Mixture-of-Experts,MoE)模型时存在明显不足。由于MoE模型的训练状态规模显著更大,其检查点开销问题尤为突出,常导致代价高昂的训练停滞或漫长的恢复过程,严重降低训练效率。本文提出MoEtion,一个专为MoE模型设计的分布式内存检查点系统。MoEtion基于三个核心思想构建:(1)稀疏检查点机制,通过跨训练迭代对专家子集进行增量快照以降低开销;(2)稀疏至稠密检查点转换机制,能够从稀疏快照中逐步重建一致的稠密检查点;(3)在流水线阶段边界对激活值和梯度进行上游日志记录,实现无需重新执行未受影响工作节点的局部恢复。在包含多达64个专家的多种MoE模型上的评估表明,相较于最先进方法,MoEtion将检查点开销降低达 \(4\times\),恢复开销降低达 \(31\times\),即使在频繁故障(平均故障间隔时间低至10分钟)条件下仍能维持高达 $\ge 0.94$ 的有效训练时间比(Effective Training Time Ratio,ETTR),并实现高达 $8\times$ 的整体训练加速,且完全不影响同步训练语义。总体而言,MoEtion为下一代稀疏激活模型提供了鲁棒且可扩展的容错解决方案。