The computing demand of AI is growing at an unprecedented rate, but energy supply is not keeping pace. As a result, energy has become an expensive and contended resource that requires explicit management and optimization. Although recent works have made significant progress in large model training optimization, they focus on optimizing either dynamic or static energy consumption. We find that fine-grained kernel scheduling and frequency scaling jointly and interdependently impact both dynamic and static energy consumption. Based on this finding, we design Kareus, a training system that pushes the time-energy tradeoff frontier by optimizing both aspects. Kareus decomposes the intractable joint optimization problem into local, partition-based subproblems. It then uses a multi-pass multi-objective optimization algorithm to find execution schedules that push the time-energy tradeoff frontier. Compared to the state of the art, Kareus reduces training energy by up to 28.3% at the same training time, or reduces training time by up to 27.5% at the same energy consumption.
翻译:人工智能的计算需求正以前所未有的速度增长,但能源供应却难以同步跟上。因此,能源已成为需要明确管理和优化的昂贵且竞争性资源。尽管近期工作在大模型训练优化方面取得了显著进展,但这些研究仅关注动态能耗或静态能耗的某一项。我们发现细粒度的核调度与频率缩放会共同且相互依赖地影响动态与静态能耗。基于这一发现,我们设计了Kareus训练系统,通过同时优化两个维度来推动时间-能耗权衡前沿。Kareus将难以处理的联合优化问题分解为基于局部划分的子问题,随后采用多轮多目标优化算法寻找能推动时间-能耗权衡前沿的执行调度方案。与现有最优方法相比,在相同训练时间下Kareus可降低高达28.3%的训练能耗,或在相同能耗下缩短高达27.5%的训练时间。