High-performance computing (HPC) and supercomputing are critical in Artificial Intelligence (AI) research, development, and deployment. The extensive use of supercomputers for training complex AI models, which can take from days to months, raises significant concerns about energy consumption and carbon emissions. Traditional methods for estimating the energy consumption of HPC workloads rely on metering reports from computing nodes power supply units, assuming exclusive use of the entire node. This assumption is increasingly untenable with the advent of next-generation supercomputers that share resources to accelerate workloads, as seen in initiatives like Acceleration as a Service (XaaS) and cloud computing. This paper introduces EfiMon, an agnostic and non-invasive tool designed to extract detailed information about process execution, including instructions executed within specific time windows and CPU and RAM usage. Additionally, it captures comprehensive system metrics, such as power consumption reported by CPU sockets and PSUs. This data enables the development of prediction models to estimate the energy consumption of individual processes without requiring isolation. Using a regression-based mathematical model, our tool is able to estimate single processes' power consumption in isolated and shared resource environments. In shared scenarios, the model demonstrates robust performance, deviating by a maximum of 2.2% on Intel-based machines and 4.4% on AMD systems compared to non-shared cases. This significant accuracy showcases EfiMon's potential for enhancing energy accounting in supercomputing, contributing to more efficient and energy-aware optimisation strategies in HPC.
翻译:高性能计算(HPC)与超级计算在人工智能(AI)的研究、开发与部署中至关重要。超级计算机被广泛用于训练复杂的AI模型,其训练过程可能持续数天至数月,这引发了人们对能耗与碳排放的严重关切。传统估算HPC工作负载能耗的方法依赖于计算节点电源单元的计量报告,并假设整个节点被独占使用。随着下一代超级计算机通过共享资源来加速工作负载(如“即服务加速”(XaaS)和云计算等倡议所示),这一假设日益站不住脚。本文介绍EfiMon,一种与平台无关的非侵入式工具,旨在提取进程执行的详细信息,包括特定时间窗口内执行的指令以及CPU和内存使用情况。此外,它还能捕获全面的系统指标,例如CPU插槽和电源单元报告的功耗。这些数据使得开发预测模型以估算单个进程的能耗成为可能,而无需对其进行隔离。通过采用基于回归的数学模型,我们的工具能够在隔离及共享资源环境中估算单个进程的功耗。在共享场景下,该模型表现出稳健的性能,与非共享情况相比,在基于Intel的机器上最大偏差为2.2%,在AMD系统上最大偏差为4.4%。这一显著精度展示了EfiMon在增强超级计算能耗核算方面的潜力,有助于在HPC中实现更高效、更具能源意识的优化策略。