Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

To efficiently scale large model (LM) training, researchers transition from data parallelism (DP) to hybrid parallelism (HP) on GPU clusters, which frequently experience hardware and software failures. Existing works introduce in-memory checkpointing optimizations that snapshot parameters to device memory for rapid failure recovery. However, these methods introduce severe resource competition between checkpointing and training, which can work under DP but can hardly scale under resource-intensive HP. To ensure low checkpointing overhead for hybrid-parallel training, this paper introduces a distributed in-memory checkpointing system with near-zero in-memory saving overhead. It strives from two aspects to mitigate the on-host resource competition caused by in-memory checkpointing: (1) It introduces Hierarchical Asynchronous Snapshotting Coordination in the checkpoint saving stage. This approach uses three-level asynchronous on-device scheduling to enhance parallelism between snapshotting and training, thereby minimizing snapshotting overhead. (2) It proposes Hybrid In-memory Checkpoint Protection to enhance checkpoint completeness during hardware failures. Unlike methods that require inter-node communications, which may block training under HP, it creates intra-node redundancy with efficient resource utilization, protecting training against hardware failures with minimal overhead. With these methods, this work enables fast restart for failed HP training with Distributed In-memory Checkpoint Loading, bypassing inefficiencies in NFS reads. In our evaluation, we achieve zero in-memory checkpoint saving overhead on Frontier while training Llama-2-34B on 256 MI250X devices (512 GPUs).

翻译：为高效扩展大规模模型训练，研究者正从数据并行转向GPU集群上的混合并行策略，而集群常面临硬件与软件故障。现有工作引入了内存检查点优化技术，将参数快照保存至设备内存以实现快速故障恢复。然而，这些方法在检查点保存与训练过程之间引发了严重的资源竞争，虽可在数据并行环境下运行，却难以在资源密集的混合并行场景中扩展。为确保混合并行训练的低检查点开销，本文提出一种分布式内存检查点系统，其内存保存开销趋近于零。该系统从两方面缓解内存检查点引发的宿主机资源竞争：（1）在检查点保存阶段引入分层异步快照协调机制。该方法采用三级异步设备端调度策略，增强快照保存与训练间的并行性，从而最小化快照开销。（2）提出混合内存检查点保护方案以提升硬件故障期间检查点的完整性。与需要节点间通信（可能在混合并行环境下阻塞训练）的方法不同，本方案通过高效的资源利用创建节点内冗余，以最小开销实现针对硬件故障的训练保护。结合这些方法，本研究通过分布式内存检查点加载技术，规避了NFS读取的低效问题，实现了故障混合并行训练的快速重启。在评估实验中，我们在Frontier超算上使用256台MI250X设备（512个GPU）训练Llama-2-34B模型时，实现了零内存检查点保存开销。