Training deep neural networks (DNNs) is a major workload in datacenters today, resulting in a tremendously fast growth of energy consumption. It is important to reduce the energy consumption while completing the DL training jobs early in data centers. In this paper, we propose PowerFlow, a GPU clusters scheduler that reduces the average Job Completion Time (JCT) under an energy budget. We first present performance models for DL training jobs to predict the throughput and energy consumption performance with different configurations. Based on the performance models, PowerFlow dynamically allocates GPUs and adjusts the GPU-level or job-level configurations of DL training jobs. PowerFlow applies network packing and buddy allocation to job placement, thus avoiding extra energy consumed by cluster fragmentations. Evaluation results show that under the same energy consumption, PowerFlow improves the average JCT by 1.57 - 3.39 x at most, compared to competitive baselines.
翻译:训练深度神经网络(DNN)是当今数据中心的主要工作负载,由此导致能源消耗急剧增长。在数据中心内,既需尽早完成深度学习训练任务,又要降低能耗,这一点至关重要。本文提出PowerFlow——一种在能源预算约束下降低平均作业完成时间(JCT)的GPU集群调度器。我们首先针对深度学习训练任务建立性能模型,用以预测不同配置下的吞吐量和能耗表现。基于该性能模型,PowerFlow能够动态分配GPU,并调整训练任务的GPU级别或作业级别配置。PowerFlow在作业放置中采用网络打包与伙伴分配策略,从而避免因集群碎片化导致的额外能耗。评估结果表明,在同等能耗条件下,与竞争性基线方案相比,PowerFlow可将平均JCT最多提升1.57至3.39倍。