Pre-training large language models on massive GPU clusters has made hardware faults routine rather than rare, driving the need for resilient training systems. Yet existing frameworks either focus on specific parallelism schemes or risk drifting away from a failure-free training trajectory. We propose ReCoVer, a resilient LLM pre-training system that upholds a single invariant: each iteration keeps the number of microbatches constant, ensuring per-iteration gradients remain stochastically equivalent to a failure-free run. The framework is organized as three decoupled protocol layers: (1) Fault-tolerant collectives that isolate faults from propagating across replicas; (2) in-step fine-grained recovery that preserves intra-iteration progress and prevents gradient corruption; (3) versatile-workload policy that dynamically redistributes microbatch quotas across the survivors. The design is parallelism-agnostic, integrating directly with both 3D parallelism and Hybrid Sharded Data Parallel (HSDP) as a drop-in substrate. We evaluate our implementation on end-to-end pre-training tasks for up to 512 GPUs, ReCoVer successfully preserves the training trajectory from a failure-free reference despite of 256 GPUs lost spread across the run. For comparison with checkpoint-and-restart baselines, ReCoVer demonstrates $2.23\times$ higher effective throughput after successive failures. This advantage results in ReCoVer processing 74.9% more tokens at 234 GPU-hours, with the gap widening as the training prolongs.
翻译:在大规模GPU集群上预训练大语言模型已使硬件故障从偶发变为常态,亟需弹性训练系统。然而现有框架要么局限于特定并行策略,要么存在偏离故障自由训练轨迹的风险。本文提出ReCoVer——一个维持单一不变量的弹性大语言模型预训练系统:每次迭代保持微批次数量恒定,确保每轮梯度在统计上与无故障运行等价。该框架由三个解耦协议层构成:(1)容错集合层,阻止故障跨副本扩散;(2)步内细粒度恢复层,保护迭代内计算进程并防止梯度损坏;(3)灵活工作负载策略层,在幸存节点间动态重新分配微批次配额。该设计具备并行策略无关性,可直接作为即插即用基底集成到3D并行与混合分片数据并行(HSDP)中。在多达512个GPU的端到端预训练任务评估中,ReCoVer在运行过程中即使丢失256个GPU仍能准确保持与无故障参考相同的训练轨迹。与检查点恢复基线相比,ReCoVer在连续故障后有效吞吐量提升2.23倍。该优势使ReCoVer在234GPU小时内多处理74.9%的token,且差异随训练持续而扩大。