As large-scale HPC compute clusters increasingly adopt accelerators such as GPUs to meet the voracious demands of modern workloads, these clusters are increasingly becoming power constrained. Unfortunately, modern applications can often temporarily exceed the power ratings of the accelerators ("power spikes"). Thus, current and future HPC systems must optimize for both power and performance together. However, this is made difficult by increasingly diverse applications, which often require bespoke optimizations to run efficiently on each cluster. Traditionally researchers overcome this problem by profiling applications on specific clusters and optimizing, but the scale, algorithmic diversity, and lack of effective tools make this challenging. To overcome these inefficiencies, we propose Minos, a systematic classification mechanism that identifies similar application characteristics via low-cost profiling for power and performance. This allows us to group similarly behaving workloads into a finite number of distinct classes and reduce the overhead of extensively profiling new workloads. For example, when predicting frequency capping behavior for a previously unseen application, Minos reduces profiling time by 89%. Moreover, across 18 popular graph analytics, HPC, HPC+ML, and ML workloads, Minos achieves a mean error of 4% for power predictions and 3% for performance predictions, significantly improving predictions over state-of-the-art approaches by 10%.
翻译:随着大规模HPC计算集群日益采用GPU等加速器以满足现代工作负载的庞大需求,这些集群正面临越来越严重的功耗约束。不幸的是,现代应用常会暂时超出加速器的额定功率(即"功率尖峰")。因此,当前及未来的HPC系统必须同时优化功耗与性能。然而日益多样的应用使这一目标难以实现,这些应用通常需要定制化优化才能在特定集群上高效运行。传统上研究者通过针对特定集群分析应用特征并进行优化来应对此问题,但规模化不足、算法多样性以及缺乏有效工具使这一过程充满挑战。为克服这些低效问题,我们提出Minos——一种通过低成本特征分析对功率与性能进行系统分类的机制,它能识别相似的应用特征。该机制可将行为相似的工作负载归入有限类别,并降低大规模分析新工作负载的开销。例如,在预测未见应用的频率上限行为时,Minos将特征分析时间减少89%。此外,在18个主流的图分析、HPC、HPC+ML及ML工作负载中,Minos在功率预测上实现平均4%的误差,性能预测误差为3%,较现有最优方法显著提升10%的预测精度。