The computing demand of AI is growing at an unprecedented rate, but energy supply is not keeping pace. As a result, energy has become an expensive, contended resource that requires explicit management and optimization. Although recent works have made significant progress in large model training optimization, they focus only on a single aspect of energy consumption: dynamic or static energy. We find that fine-grained kernel scheduling and frequency scaling jointly and interdependently impact both dynamic and static energy consumption. Based on this finding, we design Kareus, a training system that pushes the time--energy tradeoff frontier by optimizing both aspects. Kareus decomposes the intractable joint optimization problem into local, partition-based subproblems. It then uses a multi-pass multi-objective optimization algorithm to find execution schedules that push the time--energy tradeoff frontier. Compared to the state of the art, Kareus reduces training energy by up to 28.3% at the same training time, or reduces training time by up to 27.5% at the same energy consumption.
翻译:人工智能的计算需求正以前所未有的速度增长,但能源供应却未能同步跟进。因此,能源已成为一种昂贵且备受争夺的资源,需要明确的管理与优化。尽管近期研究在大模型训练优化方面取得了显著进展,但这些工作仅关注能耗的单一维度:动态能耗或静态能耗。我们发现,细粒度的内核调度与频率调节会以相互关联的方式共同影响动态与静态能耗。基于这一发现,我们设计了Kareus训练系统,通过同步优化这两个维度来推进时间-能耗权衡边界。Kareus将这一复杂的联合优化问题分解为基于局部分区的子问题,随后采用多轮次多目标优化算法寻找能够推进时间-能耗权衡边界的执行调度方案。相较于现有最优方案,Kareus在相同训练时间内可降低高达28.3%的训练能耗,或在相同能耗水平下缩短高达27.5%的训练时间。