Sustainability in high performance computing (HPC) is a major challenge not only for HPC centers and their users, but also for society as the climate goals become stricter. A lot of effort went into reducing the energy consumption of systems in general. Even though certain efforts to optimize the energy-efficiency of HPC workloads exist, most such efforts propose solutions targeting CPUs. As HPC systems shift more and more to GPU-centric architectures, simulation codes increasingly adopt GPU-programming models. This leads to an urgent need to increase the energy-efficiency of GPU-enabled codes. However, studies for reducing the energy consumption of large-scale simulations executing on CPUs and GPUs have received insufficient attention. In this work, we enable accurate power and energy measurements using an open-source toolkit across a range of CPU+GPU node architectures. We use this approach in SPH-EXA, an open-source GPU-centric astrophysical and cosmological simulation framework. We show that with simple code instrumentation, users can accurately measure power and energy related data about their application, beyond data provided by HPC systems alone. The accurate power and energy data provide significant insight to users for conducting energy-aware computational experiments and future energy-aware code development.
翻译:高性能计算(HPC)领域的可持续性不仅是HPC中心及其用户面临的重大挑战,随着气候目标日益严格,这也成为全社会亟需解决的问题。为降低系统整体能耗已投入大量工作。尽管存在部分针对HPC工作负载能效优化的尝试,但多数方案聚焦于中央处理器(CPU)。随着HPC系统逐渐转向以图形处理器(GPU)为中心的架构,仿真代码也越来越多地采用GPU编程模型。这迫切要求提升支持GPU的代码的能效。然而,针对在CPU和GPU上执行的大规模模拟能耗降低的研究尚未得到充分关注。本研究利用开源工具包,在多种CPU+GPU节点架构上实现了精确的功耗与能量测量。我们将该方法应用于SPH-EXA(一个开源、以GPU为中心的天体物理与宇宙学模拟框架),证明通过简单代码插桩,用户可精确测量与应用程序相关的功耗及能量数据,其精度超越HPC系统单独提供的数据。这些精确的功耗与能量数据为用户开展能量感知计算实验及未来能量感知代码开发提供了重要洞察。