Towards Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

To efficiently scale large model (LM) training, researchers transition from data parallelism (DP) to hybrid parallelism (HP) on GPU clusters, which frequently experience hardware and software failures. Existing works introduce in-memory checkpointing optimizations that snapshot parameters to device memory for rapid failure recovery. However, these methods introduce severe resource competition between checkpointing and training, which can work under DP but can hardly scale under resource-intensive HP. To ensure low checkpointing overhead for hybrid-parallel training, this paper introduces a distributed in-memory checkpointing system with near-zero in-memory saving overhead. It strives from two aspects to mitigate the on-host resource competition caused by in-memory checkpointing: (1) It introduces Hierarchical Asynchronous Snapshotting Coordination in the checkpoint saving stage. This approach uses three-level asynchronous on-device scheduling to enhance parallelism between snapshotting and training, thereby minimizing snapshotting overhead. (2) It proposes Hybrid In-memory Checkpoint Protection to enhance checkpoint completeness during hardware failures. Unlike methods that require inter-node communications, which may block training under HP, it creates intra-node redundancy with efficient resource utilization, protecting training against hardware failures with minimal overhead. With these methods, this work enables fast restart for failed HP training with Distributed In-memory Checkpoint Loading, bypassing inefficiencies in NFS reads. In our evaluation, we achieve zero in-memory checkpoint saving overhead on Frontier while training Llama-2-34B on 256 MI250X devices (512 GPUs).

翻译：为高效扩展大规模模型（LM）训练，研究者正从数据并行（DP）转向在GPU集群上采用混合并行（HP）策略，而此类集群常面临硬件与软件故障。现有工作引入了内存检查点优化技术，将参数快照保存至设备内存以实现快速故障恢复。然而，这些方法在检查点保存与训练过程之间引发了严重的资源竞争，虽可在DP环境下工作，却难以在资源密集的HP场景下扩展。为确保混合并行训练的低检查点开销，本文提出了一种分布式内存检查点系统，其内存保存开销近乎为零。该系统从两方面着力缓解内存检查点引发的宿主机资源竞争：（1）在检查点保存阶段引入分层异步快照协调机制。该方法采用三级异步设备端调度，增强快照与训练间的并行性，从而最小化快照开销。（2）提出混合内存检查点保护方案，以提升硬件故障期间检查点的完整性。不同于需要节点间通信（可能在HP下阻塞训练）的方法，该方案通过高效的资源利用创建节点内冗余，以最小开销保护训练过程免受硬件故障影响。结合这些方法，本研究通过分布式内存检查点加载实现了故障HP训练的快速重启，规避了NFS读取的低效问题。在评估中，我们在Frontier系统上使用256台MI250X设备（512个GPU）训练Llama-2-34B模型时，实现了零内存检查点保存开销。