As the size of deep learning models gets larger and larger, training takes longer time and more resources, making fault tolerance more and more critical. Existing state-of-the-art methods like CheckFreq and Elastic Horovod need to back up a copy of the model state (i.e., parameters and optimizer states) in memory, which is costly for large models and leads to non-trivial overhead. This paper presents SWIFT, a novel recovery design for distributed deep neural network training that significantly reduces the failure recovery overhead without affecting training throughput and model accuracy. Instead of making an additional copy of the model state, SWIFT resolves the inconsistencies of the model state caused by the failure and exploits the replicas of the model state in data parallelism for failure recovery. We propose a logging-based approach when replicas are unavailable, which records intermediate data and replays the computation to recover the lost state upon a failure. The re-computation is distributed across multiple machines to accelerate failure recovery further. We also log intermediate data selectively, exploring the trade-off between recovery time and intermediate data storage overhead. Evaluations show that SWIFT significantly reduces the failure recovery time and achieves similar or better training throughput during failure-free execution compared to state-of-the-art methods without degrading final model accuracy. SWIFT can also achieve up to 1.16x speedup in total training time compared to state-of-the-art methods.
翻译:随着深度学习模型规模持续增大,训练所需时间和资源不断增加,使得容错机制变得愈发关键。现有最先进方法(如CheckFreq和Elastic Horovod)需在内存中备份一份模型状态(即参数和优化器状态)副本,这对大型模型成本高昂且带来显著开销。本文提出SWIFT——一种面向分布式深度神经网络训练的新型恢复设计方案,该方案在不影响训练吞吐量和模型精度的前提下,显著降低故障恢复开销。SWIFT不额外创建模型状态副本,而是通过解析由故障导致的模型状态不一致性,利用数据并行中的模型状态副本进行故障恢复。针对副本不可用的情况,我们提出基于日志的方法,通过记录中间数据并在故障发生时重放计算来恢复丢失的状态。重计算过程被分布到多台机器上以进一步加速故障恢复。此外,我们选择性记录中间数据,在恢复时间和中间数据存储开销之间寻求权衡。评估结果表明,SWIFT显著缩短了故障恢复时间,并在无故障运行期间达到与现有最先进方法相当或更优的训练吞吐量,同时不降低最终模型精度。与现有最先进方法相比,SWIFT的总训练时间可提升至1.16倍加速。