The rapid growth of AI has fueled the expansion of accelerator- or GPU-based data centers. However, the rising operational energy consumption has emerged as a critical bottleneck and a major sustainability concern. Dynamic Voltage and Frequency Scaling (DVFS) is a well-known technique used to reduce energy consumption, and thus improve energy-efficiency, since it requires little effort and works with existing hardware. Reducing the energy consumption of training and inference of Large Language Models (LLMs) through DVFS or power capping is feasible: related work has shown energy savings can be significant, but at the cost of significant slowdowns. In this work, we focus on reducing waste in LLM operations: i.e., reducing energy consumption without losing performance. We propose a fine-grained, kernel-level, DVFS approach that explores new frequency configurations, and prove these save more energy than previous, pass- or iteration-level solutions. For example, for a GPT-3 training run, a pass-level approach could reduce energy consumption by 2% (without losing performance), while our kernel-level approach saves as much as 14.6% (with a 0.6% slowdown). We further investigate the effect of data and tensor parallelism, and show our discovered clock frequencies translate well for both. We conclude that kernel-level DVFS is a suitable technique to reduce waste in LLM operations, providing significant energy savings with negligible slow-down.
翻译:人工智能的快速发展推动了基于加速器或GPU的数据中心扩张。然而,不断增长的运行能耗已成为关键瓶颈和重要的可持续性问题。动态电压频率缩放(DVFS)是一种众所周知的降低能耗、从而提高能效的技术,因其实施难度低且兼容现有硬件而备受关注。通过DVFS或功耗封顶来降低大型语言模型(LLMs)训练与推理的能耗是可行的:已有研究表明节能效果显著,但代价是显著的性能下降。本研究聚焦于减少LLM运行中的浪费:即在保持性能的前提下降低能耗。我们提出了一种细粒度的内核级DVFS方法,通过探索新的频率配置,证明其比以往的阶段级或迭代级解决方案能节省更多能源。例如,在GPT-3训练任务中,阶段级方法可降低2%的能耗(无性能损失),而我们的内核级方法可节省高达14.6%的能耗(仅产生0.6%的性能下降)。我们进一步研究了数据并行与张量并行的效应,证明所发现的时钟频率配置在两种并行模式下均具有良好的适应性。本文结论表明,内核级DVFS是减少LLM运行浪费的有效技术,能以可忽略的性能损失实现显著的节能效果。