Deep Learning Training (DLT) is a growing workload in shared GPU/CPU clusters due to its high computational cost and increasing number of jobs. This contributes to significant energy consumption in GPU clusters, further exacerbated by GPU under-utilization, as shown in production cluster logs. Addressing this challenge requires workload scheduling and resource allocation policies for efficient GPU sharing to improve resource and energy efficiency while maintaining performance. However, previous works primarily optimize for performance, often overlooking or even sacrificing energy efficiency. In this paper, we present EaCO, the first energy-aware scheduling algorithm designed specifically for DLT workloads in GPU clusters. EaCO leverages hardware-supported context switching to enable GPU sharing across multiple DLT jobs, improving resource and energy utilization. GPU sharing can increase Job Completion Time (JCT) and may lead to contention if not employed carefully. To address this, EaCO integrates experiment and historical-based predictions as well as early-stage observations, ensuring performance expectations are met while optimizing energy efficiency. We begin by experimentally exploring the dynamics of co-locating DLTs, investigating its impact on energy and resource utilization. Our results show that co-location improves energy efficiency by up to 44% for individual jobs, and increases average GPU utilization to as high as 97%. Additionally, evaluations on large-scale clusters using production traces demonstrate that EaCO reduces total energy by up to 39% compared to existing algorithms, which comes with a minimal increase in job runtime-less than 3.2% in our simulations.
翻译:深度学习训练因其高昂的计算成本和日益增长的任务数量,已成为共享GPU/CPU集群中不断增长的工作负载。生产集群日志显示,这导致GPU集群能耗显著增加,而GPU利用率不足进一步加剧了这一问题。应对这一挑战需要制定工作负载调度与资源分配策略,以实现高效的GPU共享,从而在保持性能的同时提升资源与能源效率。然而,先前的研究主要针对性能进行优化,常常忽视甚至牺牲能源效率。本文提出EaCO,这是首个专为GPU集群中的深度学习训练工作负载设计的能源感知调度算法。EaCO利用硬件支持的上下文切换技术,实现多个深度学习训练任务间的GPU共享,从而提高资源与能源利用率。GPU共享可能增加任务完成时间,若使用不当还会引发资源争用。为此,EaCO整合了基于实验与历史的预测以及早期阶段观测机制,确保在优化能源效率的同时满足性能预期。我们首先通过实验探索深度学习训练任务共置的动态特性,研究其对能源与资源利用率的影响。实验结果表明,共置可使单个任务的能源效率提升高达44%,并将平均GPU利用率提升至97%。此外,基于生产轨迹的大规模集群评估显示,与现有算法相比,EaCO可降低总能耗高达39%,而任务运行时间仅增加不到3.2%(在我们的模拟中)。