SWIFT: Expedited Failure Recovery for Large-scale DNN Training

As the size of deep learning models gets larger and larger, training takes longer time and more resources, making fault tolerance more and more critical. Existing state-of-the-art methods like CheckFreq and Elastic Horovod need to back up a copy of the model state (i.e., parameters and optimizer states) in memory, which is costly for large models and leads to non-trivial overhead. This paper presents SWIFT, a novel recovery design for distributed deep neural network training that significantly reduces the failure recovery overhead without affecting training throughput and model accuracy. Instead of making an additional copy of the model state, SWIFT resolves the inconsistencies of the model state caused by the failure and exploits the replicas of the model state in data parallelism for failure recovery. We propose a logging-based approach when replicas are unavailable, which records intermediate data and replays the computation to recover the lost state upon a failure. The re-computation is distributed across multiple machines to accelerate failure recovery further. We also log intermediate data selectively, exploring the trade-off between recovery time and intermediate data storage overhead. Evaluations show that SWIFT significantly reduces the failure recovery time and achieves similar or better training throughput during failure-free execution compared to state-of-the-art methods without degrading final model accuracy. SWIFT can also achieve up to 1.16x speedup in total training time compared to state-of-the-art methods.

翻译：随着深度学习模型规模持续增大，训练所需时间和资源不断增加，使得容错机制变得愈发关键。现有最先进方法（如CheckFreq和Elastic Horovod）需在内存中备份一份模型状态（即参数和优化器状态）副本，这对大型模型成本高昂且带来显著开销。本文提出SWIFT——一种面向分布式深度神经网络训练的新型恢复设计方案，该方案在不影响训练吞吐量和模型精度的前提下，显著降低故障恢复开销。SWIFT不额外创建模型状态副本，而是通过解析由故障导致的模型状态不一致性，利用数据并行中的模型状态副本进行故障恢复。针对副本不可用的情况，我们提出基于日志的方法，通过记录中间数据并在故障发生时重放计算来恢复丢失的状态。重计算过程被分布到多台机器上以进一步加速故障恢复。此外，我们选择性记录中间数据，在恢复时间和中间数据存储开销之间寻求权衡。评估结果表明，SWIFT显著缩短了故障恢复时间，并在无故障运行期间达到与现有最先进方法相当或更优的训练吞吐量，同时不降低最终模型精度。与现有最先进方法相比，SWIFT的总训练时间可提升至1.16倍加速。

相关内容

Swift

关注 101

苹果公司在 WWDC 2014 开幕 Keynote 上发布的全新编程语言，具有更多现代化特性，同时容易使用，定位是补充 Objective-C. > Swift is an innovative new programming language for Cocoa and Cocoa Touch. Writing code is interactive and fun, the syntax is concise yet expressive, and apps run lightning-fast. Swift is ready for your next iOS and OS X project — or for addition into your current app — because Swift code works side-by-side with Objective-C.

Swift - Apple Developer

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

105+阅读 · 2022年2月10日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

UC.Berkeley CS189讲义教材:《机器学习全面指南》，185页pdf

专知会员服务