The rise of Artificial Intelligence and Large Language Models is driving increased GPU usage in data centers for complex training and inference tasks, impacting operational costs, energy demands, and the environmental footprint of large-scale computing infrastructures. This work addresses the online scheduling problem in GPU datacenters, which involves scheduling tasks without knowledge of their future arrivals. We focus on two objectives: minimizing GPU fragmentation and reducing power consumption. GPU fragmentation occurs when partial GPU allocations hinder the efficient use of remaining resources, especially as the datacenter nears full capacity. A recent scheduling policy, Fragmentation Gradient Descent (FGD), leverages a fragmentation metric to address this issue. Reducing power consumption is also crucial due to the significant power demands of GPUs. To this end, we propose PWR, a novel scheduling policy to minimize power usage by selecting power-efficient GPU and CPU combinations. This involves a simplified model for measuring power consumption integrated into a Kubernetes score plugin. Through an extensive experimental evaluation in a simulated cluster, we show how PWR, when combined with FGD, achieves a balanced trade-off between reducing power consumption and minimizing GPU fragmentation.
翻译:人工智能和大型语言模型的兴起,正推动数据中心中GPU在复杂训练和推理任务上的使用量增加,这影响了大规模计算基础设施的运营成本、能源需求及环境足迹。本研究致力于解决GPU数据中心中的在线调度问题,该问题涉及在未知任务未来到达的情况下对其进行调度。我们聚焦于两个目标:最小化GPU碎片和降低功耗。当部分GPU分配阻碍了剩余资源的高效利用时,特别是当数据中心接近满载时,就会产生GPU碎片。一种近期的调度策略——碎片梯度下降(FGD)——利用碎片度量来解决此问题。由于GPU的显著功耗需求,降低功耗也至关重要。为此,我们提出了PWR,一种新颖的调度策略,通过选择能效高的GPU与CPU组合来最小化功耗。这涉及一个用于测量功耗的简化模型,该模型被集成到一个Kubernetes评分插件中。通过在模拟集群中进行广泛的实验评估,我们展示了PWR与FGD结合时,如何在降低功耗和最小化GPU碎片之间实现平衡的权衡。