As large-scale HPC compute clusters increasingly adopt accelerators such as GPUs to meet the voracious demands of modern workloads, these clusters are increasingly becoming power constrained. Unfortunately, modern applications can often temporarily exceed the power ratings of the accelerators ("power spikes"). Thus, current and future HPC systems must optimize for both power and performance together. However, this is made difficult by increasingly diverse applications, which often require bespoke optimizations to run efficiently on each cluster. Traditionally researchers overcome this problem by profiling applications on specific clusters and optimizing, but the scale, algorithmic diversity, and lack of effective tools make this challenging. To overcome these inefficiencies, we propose Minos, a systematic classification mechanism that identifies similar application characteristics via low-cost profiling for power and performance. This allows us to group similarly behaving workloads into a finite number of distinct classes and reduce the overhead of extensively profiling new workloads. For example, when predicting frequency capping behavior for a previously unseen application, Minos reduces profiling time by 89%. Moreover, across 18 popular graph analytics, HPC, HPC+ML, and ML workloads, Minos achieves a mean error of 4% for power predictions and 3% for performance predictions, significantly improving predictions over state-of-the-art approaches by 10%.
翻译:随着大规模HPC计算集群日益采用GPU等加速器来满足现代工作负载的旺盛需求,这些集群逐渐面临功耗约束。不幸的是,现代应用常常会暂时超出加速器的额定功率(即"功率尖峰")。因此,当前及未来的HPC系统必须同时优化功耗与性能。然而,日益多样化的应用使这一任务变得困难,这些应用通常需要针对特定集群进行定制优化才能高效运行。传统上,研究人员通过在特定集群上对应用进行分析并优化来克服该问题,但规模、算法多样性以及有效工具的缺失使得这一过程充满挑战。为消除这些低效问题,我们提出Minos——一种通过低成本性能与功耗分析来识别相似应用特征的系统化分类机制。这使我们能够将行为相似的工作负载归入有限的不同类别,并降低对新工作负载进行广泛分析的开销。例如,在预测未见应用的频率封顶行为时,Minos将分析时间减少了89%。此外,在18种主流的图分析、HPC、HPC+ML及ML工作负载上,Minos对功耗预测的平均误差为4%,对性能预测的平均误差为3%,相比现有最优方法将预测精度显著提升了10%。