Sparse Checkpointing for Fast and Reliable MoE Training

As large language models scale, training them requires thousands of GPUs over extended durations--making frequent failures an inevitable reality. While checkpointing remains the primary fault-tolerance mechanism, existing methods fall short when applied to Mixture-of-Experts (MoE) models. Due to their substantially larger training state, MoE models exacerbate checkpointing overheads, often causing costly stalls or prolonged recovery that severely degrade training efficiency. We present MoEtion, a distributed, in-memory checkpointing system tailored for MoE models. MoEtion is built on three key ideas: (1) sparse checkpointing, which incrementally snapshots subsets of experts across iterations to reduce overhead; (2) a sparse-to-dense checkpoint conversion mechanism that incrementally reconstructs consistent dense checkpoints from sparse snapshots; and (3) upstream logging of activations and gradients at pipeline-stage boundaries, enabling localized recovery without re-executing unaffected workers. Evaluations across diverse MoE models with up to 64 experts show that MoEtion reduces checkpointing overhead by up to $4\times$ and recovery overhead by up to $31\times$ compared to state-of-the-art approaches, sustaining consistently high Effective Training Time Ratios (ETTR) of up to $\ge 0.94$ even under frequent failures (MTBF as low as 10 minutes) and delivering up to $8\times$ overall training speedup, all without compromising synchronous training semantics. Overall, MoEtion offers a robust and scalable fault-tolerance solution for the next generation of sparsely activated models.

翻译：随着大语言模型规模的扩大，其训练过程需要数千个GPU持续运行较长时间——这使得频繁故障成为不可避免的现实。虽然检查点技术仍是主要的容错机制，但现有方法应用于混合专家（Mixture-of-Experts，MoE）模型时存在明显不足。由于MoE模型的训练状态规模显著更大，其检查点开销问题尤为突出，常导致代价高昂的训练停滞或漫长的恢复过程，严重降低训练效率。本文提出MoEtion，一个专为MoE模型设计的分布式内存检查点系统。MoEtion基于三个核心思想构建：（1）稀疏检查点机制，通过跨训练迭代对专家子集进行增量快照以降低开销；（2）稀疏至稠密检查点转换机制，能够从稀疏快照中逐步重建一致的稠密检查点；（3）在流水线阶段边界对激活值和梯度进行上游日志记录，实现无需重新执行未受影响工作节点的局部恢复。在包含多达64个专家的多种MoE模型上的评估表明，相较于最先进方法，MoEtion将检查点开销降低达 $4\times$，恢复开销降低达 $31\times$，即使在频繁故障（平均故障间隔时间低至10分钟）条件下仍能维持高达 $\ge 0.94$ 的有效训练时间比（Effective Training Time Ratio，ETTR），并实现高达 $8\times$ 的整体训练加速，且完全不影响同步训练语义。总体而言，MoEtion为下一代稀疏激活模型提供了鲁棒且可扩展的容错解决方案。