Hybrid parallelism underpins large-scale LLM training across tens of thousands of GPUs. At such scale, hardware failures on individual devices lead to performance skew across devices, diminishing overall training efficiency. Existing resilient systems overlook sequence length variability in datasets and device performance skew under hybrid parallelism. As a result, (1) iteration time fluctuations induced by sequence length variability can trigger spurious fail-slow detections, and (2) failures are mitigated through individual adaptations in hybrid parallelism, leading to unnecessary detection overhead and inefficient resilient training. To respond, this paper presents ResiHP, a resilient system that enables robust failure detection and fine-grained adaptation for hybrid parallel training. First, we develop a Detector to accurately identify failures. In particular, it employs a workload-aware execution time predictor that disentangles failures from iteration time fluctuations while remaining lightweight for online detection. Second, we design a Scheduler that dynamically adapts parallelism group sizes, model partitioning, and workload scheduling policies to improve training efficiency under failures. Experiments show that ResiHP improves training throughput by 1.04-4.39$\times$ compared with state-of-the-art resilient training systems under diverse failure scenarios in a 256-GPU cluster.
翻译:混合并行是大规模大语言模型在数万GPU上训练的基础支撑。在此规模下,单设备硬件故障会导致跨设备性能偏差,降低整体训练效率。现有弹性系统忽视了数据集中序列长度可变性以及混合并行下的设备性能偏差。因此,(1)序列长度可变性引发的迭代时间波动可能导致虚假的慢速故障检测,以及(2)故障通过混合并行中的个体自适应来缓解,导致不必要的检测开销和低效的弹性训练。为此,本文提出ResiHP,一个可实现鲁棒故障检测和细粒度自适应的混合并行训练弹性系统。首先,我们开发了一个检测器以准确识别故障。具体而言,它采用一种工作负载感知的执行时间预测器,可在保持轻量级在线检测的同时,将故障与迭代时间波动区分开。其次,我们设计了一个调度器,可动态调整并行组大小、模型分区和工作负载调度策略,以提升故障下的训练效率。实验表明,与最先进的弹性训练系统相比,ResiHP在256GPU集群的各类故障场景下可将训练吞吐量提升1.04-4.39倍。