Modern exascale GPU- and APU-based systems provide multiple power and energy sensors, but differences in scope, update rate, timing, and filtering complicate the attribution of short-lived accelerator activity. This paper presents a methodology to characterize and correct these effects on Cray EX systems with AMD Instinct MI250X GPUs (Frontier) and MI300A APUs (Portage). Using controlled square-wave workloads, we quantify update intervals, delay, aliasing, and variability across up to 512 GPUs and 480 APUs with on-chip (rocm-smi/amd-smi) and off-chip Cray Power Management sensors. We reconstruct power from cumulative energy counters to achieve faster response times, validate it against on-chip, off-chip, and node-level sensors, and integrate the resulting streams into a Score-P/PAPI-based tool for time-aligned, phase-level attribution. Applied to rocHPL, rocHPL-MxP, and HPG-MxP, the method separates energy savings due to reduced runtime from changes in power. Mixed precision reduces node energy on Frontier by 79% for rocHPL-MxP and 31% for HPG-MxP, with similar trends on Portage. These results provide portable guidance for sensor validation and power-aware optimization on current and future exascale systems.
翻译:现代基于GPU和APU的百亿亿次系统提供了多种功耗与能量传感器,但由于作用范围、更新速率、时序以及滤波机制的差异,使得对短时加速器活动的归因变得复杂。本文提出了一种方法,用于在配备AMD Instinct MI250X GPU(Frontier)和MI300A APU(Portage)的Cray EX系统上表征并修正这些影响。通过受控的方波负载,我们量化了多达512个GPU和480个APU上的更新间隔、延迟、混叠与变异性,使用了片上(rocm-smi/amd-smi)与片外Cray Power Management传感器。我们通过累积能量计数器重构功耗以实现更快的响应时间,并针对片上、片外及节点级传感器进行了验证,随后将生成的流集成到基于Score-P/PAPI的工具中,用于时间对齐的逐阶段归因。将该方法应用于rocHPL、rocHPL-MxP及HPG-MxP,可分离出因运行时间缩短导致的能量节省与功耗变化。混合精度在Frontier上使rocHPL-MxP节点能量降低79%,HPG-MxP降低31%,Portage上呈现相似趋势。这些结果为当前及未来百亿亿次系统上的传感器验证与功耗感知优化提供了可移植的指导。