We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling. Additionally, to enable a cost-effective methodology for large-scale experiments, we develop a data-driven DDL cluster simulation platform. Employing the simulation platform we compare against several state-of-the-art alternatives on real-world workload traces to demonstrate the benefits of our design. Our scheduler can provide improvement of up to 69% in end-to-end Makespan for training all jobs compared to the prevailing consolidation-based scheduling methods, while reducing the average job completion time by up to 83% and minimizing the communication overheads by up to 98% under congested networking conditions.
翻译:我们提出了一种新颖的GPU集群调度器,用于处理分布式深度学习(DDL)工作负载。该调度器能够基于DDL任务对预期通信网络延迟的敏感度,实现GPU资源的邻近性整合。调度器包含三个主要组件:(i)经典的延迟调度算法,用于任务放置与整合;(ii)网络敏感的任务抢占策略;(iii)"自动调优器"机制,用于优化延迟调度中的定时器参数。此外,为实现大规模实验的经济高效方法,我们开发了一个数据驱动的DDL集群仿真平台。利用该仿真平台,我们在真实工作负载轨迹上将所提方案与多种最新技术进行了对比,验证了其设计优势。与主流整合式调度方法相比,我们的调度器在训练所有任务的端到端完成时间上可提升高达69%,同时平均任务完成时间降低83%,在网络拥塞条件下通信开销减少高达98%。