The era of large deep learning models has given rise to advanced training strategies such as 3D parallelism and the ZeRO series. These strategies enable various (re-)configurable execution plans for a training job, which exhibit remarkably different requirements of multiple resource types. Existing cluster scheduling systems, however, treat such reconfigurable training jobs as black boxes: they rely on users to choose execution plans statically, and then make resource allocations without awareness of the chosen plans and their resource requirements. This approach results in mismatches between execution plans and resources, making both training performance and cluster utilization far from optimal. We introduce Rubick, a cluster scheduling system for deep learning training that exploits the reconfigurability to improve job performance and cluster efficiency. Rubick incorporates the job execution planning as a new dimension in cluster scheduling, by continuously reconfiguring jobs' execution plans and tuning multi-resource allocations across jobs jointly. Such a co-optimization is navigated by a performance model that understands the diverse resource requirements and performance characteristics of different jobs and execution plans. Rubick exploits such a model to make performance-aware scheduling decisions to maximize cluster throughput while providing performance guarantees to individual jobs. Evaluations on a 64-GPU high-performance training cluster show that Rubick improves average job completion time and makespan by up to 3.2x and 1.4x, respectively, compared against state-of-the-art systems.
翻译:大规模深度学习模型时代催生了3D并行与ZeRO系列等先进训练策略。这些策略使得单个训练作业能够采用多种(可)重构的执行方案,而这些方案对各类资源的需求差异显著。然而,现有集群调度系统将此类可重构训练作业视为黑盒:依赖用户静态选择执行方案,并在不了解所选方案及其资源需求的情况下进行资源分配。这种方法导致执行方案与资源之间的错配,使得训练性能与集群利用率均远未达最优。本文提出Rubick——一种利用可重构性提升作业性能与集群效率的深度学习训练集群调度系统。Rubick通过持续重构作业执行方案并跨作业联合调整多资源分配,将作业执行规划作为集群调度的新维度。该协同优化过程由性能模型引导,该模型能够理解不同作业及执行方案对资源的差异化需求及其性能特征。Rubick利用该模型做出感知性能的调度决策,在保障单个作业性能的同时最大化集群吞吐量。在64-GPU高性能训练集群上的实验表明,相较于前沿调度系统,Rubick将平均作业完成时间和完工时间分别提升至多3.2倍和1.4倍。