ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation

Training large Deep Neural Network (DNN) models requires thousands of GPUs over the course of several days or weeks. At this scale, failures are frequent and can have a big impact on training throughput. Utilizing spare GPU servers to mitigate performance loss becomes increasingly costly as model sizes grow. ReCycle is a system designed for efficient DNN training in the presence of failures, without relying on spare servers. It exploits the inherent functional redundancy in distributed training systems -- where servers across data-parallel groups store the same model parameters -- and pipeline schedule bubbles within each data-parallel group. When servers fails, ReCycle dynamically re-routes micro-batches to data-parallel peers, allowing for uninterrupted training despite multiple failures. However, this re-routing can create imbalances across pipeline stages, leading to reduced training throughput. To address this, ReCycle introduces two key optimizations that ensure re-routed micro-batches are processed within the original pipeline schedule's bubbles. First, it decouples the backward pass into two phases: one for computing gradients for the input and another for calculating gradients for the parameters. Second, it avoids synchronization across pipeline stages by staggering the optimizer step. Together, these optimizations enable adaptive pipeline schedules that minimize or even eliminate training throughput degradation during failures. We describe a prototype for ReCycle and show that it achieves high training throughput under multiple failures, outperforming recent proposals for fault-tolerant training such as Oobleck and Bamboo by up to $1.46\times$ and $1.64\times$, respectively.

翻译：训练大规模深度神经网络（DNN）模型通常需要数千个GPU持续运行数天甚至数周。在此规模下，硬件故障频发，并对训练吞吐量产生显著影响。随着模型规模扩大，依赖备用GPU服务器来缓解性能损失的成本日益高昂。ReCycle是一个专为在故障环境下实现高效DNN训练而设计的系统，其不依赖于备用服务器。该系统利用分布式训练系统中固有的功能冗余——跨数据并行组的服务器存储相同的模型参数——以及各数据并行组内流水线调度中的空闲时段。当服务器发生故障时，ReCycle动态地将微批次重路由至数据并行组内的对等节点，从而在面临多重故障时仍能实现不间断训练。然而，这种重路由可能导致流水线各阶段间负载不均衡，进而降低训练吞吐量。为解决此问题，ReCycle引入两项关键优化技术，确保重路由的微批次能在原始流水线调度的空闲时段内完成处理。首先，它将反向传播过程解耦为两个阶段：第一阶段计算输入梯度，第二阶段计算参数梯度。其次，通过错位执行优化器步骤来避免流水线阶段间的同步等待。这些优化技术共同实现了自适应流水线调度，能在故障期间将训练吞吐量损失降至最低甚至完全消除。我们描述了ReCycle的原型系统，并证明其在多重故障下仍能保持较高的训练吞吐量，相较于近期提出的容错训练方案（如Oobleck和Bamboo）分别实现了最高$1.46\times$和$1.64\times$的性能提升。