Hybrid parallelism underpins large-scale LLM training across tens of thousands of GPUs. At such scale, hardware failures on individual devices lead to performance skew across devices, diminishing overall training efficiency. Existing resilient systems overlook sequence length variability in datasets and device performance skew under hybrid parallelism. As a result, (1) iteration time fluctuations induced by sequence length variability can trigger spurious fail-slow detections, and (2) failures are mitigated through individual adaptations in hybrid parallelism, leading to unnecessary detection overhead and inefficient resilient training. To respond, this paper presents ResiHP, a resilient system that enables robust failure detection and fine-grained adaptation for hybrid parallel training. First, we develop a Detector to accurately identify failures. In particular, it employs a workload-aware execution time predictor that disentangles failures from iteration time fluctuations while remaining lightweight for online detection. Second, we design a Scheduler that dynamically adapts parallelism group sizes, model partitioning, and workload scheduling policies to improve training efficiency under failures. Experiments show that ResiHP improves training throughput by 1.04-4.39$\times$ compared with state-of-the-art resilient training systems under diverse failure scenarios in a 256-GPU cluster.
翻译:混合并行技术支撑着跨数万GPU的大规模大语言模型训练。在此规模下,单设备的硬件故障会导致设备间性能偏差,降低整体训练效率。现有弹性系统忽略了混合并行中数据集的序列长度可变性和设备性能偏差问题。这导致:(1)序列长度可变性引发的迭代时间波动可能触发虚假的慢故障检测;(2)混合并行中的故障通过个体适应方式缓解,造成不必要的检测开销和低效的弹性训练。为此,本文提出ResiHP——一个面向混合并行训练的弹性系统,可实现鲁棒的故障检测和细粒度适应。首先,我们开发了一个检测器以准确识别故障。具体而言,它采用工作负载感知的执行时间预测器,能够将故障与迭代时间波动解耦,同时保持轻量级以支持在线检测。其次,我们设计了一个调度器,可动态调整并行组大小、模型分区和工作负载调度策略,从而在故障发生时提升训练效率。实验表明,在256-GPU集群的多种故障场景下,与最先进的弹性训练系统相比,ResiHP将训练吞吐量提升了1.04-4.39倍。