In the exascale era in which application behavior has large power & energy footprints, per-application job-level awareness of such impression is crucial in taking steps towards achieving efficiency goals beyond performance, such as energy efficiency, and sustainability. To achieve these goals, we have developed a novel low-latency job power profiling machine learning pipeline that can group job-level power profiles based on their shapes as they complete. This pipeline leverages a comprehensive feature extraction and clustering pipeline powered by a generative adversarial network (GAN) model to handle the feature-rich time series of job-level power measurements. The output is then used to train a classification model that can predict whether an incoming job power profile is similar to a known group of profiles or is completely new. With extensive evaluations, we demonstrate the effectiveness of each component in our pipeline. Also, we provide a preliminary analysis of the resulting clusters that depict the power profile landscape of the Summit supercomputer from more than 60K jobs sampled from the year 2021.
翻译:在百亿亿次计算时代,应用行为具有显著的功耗与能耗足迹,针对每个作业的应用级认知对于在性能之外实现能效性与可持续性等效率目标至关重要。为达成上述目标,我们开发了一套新型低延迟作业功耗剖析机器学习流水线,该流水线能够在作业完成时根据其形状对作业级功耗曲线进行分组。该流水线采用基于生成对抗网络(GAN)的全面特征提取与聚类框架,以处理富含特征的作业级功耗测量时间序列数据。其输出结果随后用于训练分类模型,该模型可预测新提交作业的功耗曲线是否与已知曲线组相似,或属于全新类别。通过广泛评估,我们证明了流水线各组件的有效性。此外,我们对从2021年采样得到的超过6万个Summit超级计算机作业所构成的功耗曲线景观进行了初步聚类分析。