Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planning-execution co-design approach, where it first generates a set of heterogeneous pipeline templates and instantiates at least $f+1$ logically equivalent pipeline replicas to tolerate any $f$ simultaneous failures. During execution, it relies on already-replicated model states across the replicas to provide fast recovery. Oobleck provably guarantees that some combination of the initially created pipeline templates can be used to cover all available resources after $f$ or fewer simultaneous failures, thereby avoiding resource idling at all times. Evaluation on large DNN models with billions of parameters shows that Oobleck provides consistently high throughput, and it outperforms state-of-the-art fault tolerance solutions like Bamboo and Varuna by up to $13.9x$.
翻译:Oobleck实现了对大型深度神经网络模型的弹性分布式训练,并具有可保证的容错能力。它采用规划-执行协同设计方法,首先生成一组异构流水线模板,并实例化至少$f+1$个逻辑等价的流水线副本,以容忍任意$f$个并发故障。在执行过程中,它利用各副本间已复制的模型状态实现快速恢复。Oobleck可严格保证:在发生$f$次或更少并发故障后,初始创建的流水线模板的某种组合可用于覆盖所有可用资源,从而始终避免资源闲置。在包含数十亿参数的大型深度神经网络模型上的评估表明,Oobleck能够持续提供高吞吐量,其性能比Bamboo和Varuna等最先进的容错方案提升高达$13.9$倍。