Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planning-execution co-design approach, where it first generates a set of heterogeneous pipeline templates and instantiates at least $f+1$ logically equivalent pipeline replicas to tolerate any $f$ simultaneous failures. During execution, it relies on already-replicated model states across the replicas to provide fast recovery. Oobleck provably guarantees that some combination of the initially created pipeline templates can be used to cover all available resources after $f$ or fewer simultaneous failures, thereby avoiding resource idling at all times. Evaluation on large DNN models with billions of parameters shows that Oobleck provides consistently high throughput, and it outperforms state-of-the-art fault tolerance solutions like Bamboo and Varuna by up to $29.6x$.
翻译:Oobleck 实现了在保证容错性的前提下,对大型深度神经网络模型进行弹性分布式训练。它采用规划与执行协同设计的方法:首先生成一组异构流水线模板,并实例化至少 $f+1$ 个逻辑等价的流水线副本,以容忍最多 $f$ 个并发故障。在执行过程中,它依靠副本间已复制的模型状态实现快速恢复。Oobleck 可证明地保证:在发生 $f$ 个或更少并发故障后,总能利用初始创建的部分流水线模板组合覆盖所有可用资源,从而避免资源闲置。在包含数十亿参数的大型深度神经网络模型上的评估表明,Oobleck 能够持续提供高吞吐量,并且其性能优于 Bamboo 和 Varuna 等最先进的容错方案,提升幅度可达 $29.6$ 倍。