Training large language models faces frequent interruptions due to various faults, demanding robust fault-tolerance. Existing backup-free methods, such as redundant computation, dynamic parallelism, and data rerouting, each incur performance penalties, whether from ongoing overhead, lengthy reconfigurations, or post-recovery inefficiencies. We propose Chameleon, an adaptive fault-tolerant system that intelligently selects optimal recovery strategies when a failure occurs. Chameleon achieves this through a unified performance model, expedient execution plan search, accurate performance estimation, and efficient communication optimizations. Experiments on a 32-card cluster show that Chameleon maintains a performance gap of within 11.00% between post-recovery and failure-free training, while preserving model convergence and efficient memory usage. Compared to state-of-the-art methods, Chameleon achieves up to 1.229x and 1.355x higher average throughput than Oobleck and Recycle, respectively.
翻译:训练大语言模型常因各类故障频繁中断,亟需鲁棒的容错机制。现有无备份方法(如冗余计算、动态并行与数据重路由)均存在性能损耗,具体表现为持续开销、冗长重配置或恢复后效率低下。本文提出Chameleon——一种自适应容错系统,可在故障发生时智能选择最优恢复策略。该系统通过统一性能模型、高效的执行计划搜索、精确的性能估计以及高效的通信优化技术实现上述目标。在32卡集群上的实验表明,Chameleon在恢复训练与无故障训练之间的性能差距维持在11.00%以内,同时保障模型收敛性与内存使用效率。相比现有最优方法,Chameleon的平均吞吐量较Oobleck和Recycle分别提升1.229倍与1.355倍。