More than half of the Top 500 supercomputers employ GPUs as accelerators. On GPU-accelerated platforms, developers face a key diagnostic gap: profilers show source lines where stalls occur, but not why they occur. Furthermore, the same kernel may have different stalls and underlying causes on different GPUs. This paper presents LEO, a root-cause analyzer for NVIDIA, AMD, and Intel GPUs that performs backward slicing from stalled instructions, considering dependencies arising from registers as well as vendor-specific synchronization mechanisms. LEO attributes GPU stalls to source instructions with the goal of explaining root causes of these inefficiencies. Across 21 workloads on three GPU platforms, LEO-guided optimizations deliver geometric-mean speedups of 1.73$\times$--1.82$\times$. Our case studies show that (1) the same kernel may require different optimizations for different GPU architectures, and (2) LEO's structured diagnostics improve code optimization with large language models relative to code-only and raw-stall-count baselines.
翻译:全球Top 500超级计算机中半数以上采用GPU作为加速器。在GPU加速平台上,开发者面临一项关键诊断瓶颈:性能分析工具能显示停顿发生的源码行,却无法揭示其成因。更关键的是,同一内核在不同GPU上可能呈现不同的停顿现象及根本原因。本文提出LEO——面向NVIDIA、AMD及Intel GPU的根因分析工具,通过从停顿指令执行反向切片,综合考虑寄存器依赖关系及厂商特定的同步机制。LEO将GPU停顿归因至源码指令,旨在解释这些低效现象的根本成因。在三个GPU平台的21项工作负载上,LEO指导的优化实现了几何平均1.73倍至1.82倍的加速比。我们的案例研究表明:(1)同一内核可能需要对不同GPU架构实施差异化优化;(2)相较于仅依赖代码或原始停顿计数的基线方法,LEO的结构化诊断能显著提升大语言模型对代码优化的效果。