SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures

Training large Deep Neural Network (DNN) models requires thousands of GPUs for days or weeks at a time. At these scales, failures are frequent and can have a big impact on training throughput. Restoring performance using spare GPU servers becomes increasingly expensive as models grow. SlipStream is a system for efficient DNN training in the presence of failures, without using spare servers. It exploits the functional redundancy inherent in distributed training systems -- servers hold the same model parameters across data-parallel groups -- as well as the bubbles in the pipeline schedule within each data-parallel group. SlipStream dynamically re-routes the work of a failed server to its data-parallel peers, ensuring continuous training despite multiple failures. However, re-routing work leads to imbalances across pipeline stages that degrades training throughput. SlipStream introduces two optimizations that allow re-routed work to execute within bubbles of the original pipeline schedule. First, it decouples the backward pass computation into two phases. Second, it staggers the execution of the optimizer step across pipeline stages. Combined, these optimizations enable schedules that minimize or even eliminate training throughput degradation during failures. We describe a prototype for SlipStream and show that it achieves high training throughput under multiple failures, outperforming recent proposals for fault-tolerant training such as Oobleck and Bamboo by up to 1.46x and 1.64x, respectively.

翻译：训练大规模深度神经网络（DNN）模型通常需要数千个GPU持续运行数天甚至数周。在此规模下，故障发生频繁，并对训练吞吐量产生显著影响。随着模型规模增长，使用备用GPU服务器来恢复性能的成本日益高昂。SlipStream是一个无需备用服务器即可在故障环境下实现高效DNN训练的系统。它利用了分布式训练系统中固有的功能冗余性——数据并行组内的服务器持有相同的模型参数——以及每个数据并行组内流水线调度中存在的气泡间隙。SlipStream能够动态地将故障服务器的工作量重新路由至其数据并行对等节点，从而确保在发生多重故障时训练仍可持续进行。然而，工作重路由会导致流水线各阶段间负载不均衡，进而降低训练吞吐量。SlipStream引入了两项优化技术，使得重路由的工作能够在原始流水线调度的气泡间隙内执行。首先，它将反向传播计算解耦为两个阶段。其次，它使优化器步骤在流水线各阶段间错峰执行。这些优化技术相结合，能够生成在故障期间最小化甚至消除训练吞吐量下降的调度方案。我们描述了SlipStream的原型系统，并证明其在多重故障下仍能实现高训练吞吐量，性能分别优于近期提出的容错训练方案Oobleck和Bamboo达1.46倍和1.64倍。