GPGPU-accelerated clusters and supercomputers are central to modern high-performance computing (HPC). Over the past decade, these systems continue to expand, and GPUs now expose a wide range of hardware counters that provide detailed views of performance and resource usage. Despite the potential of these counters, few studies have evaluated the insights they offer about real workloads at scale. In this work, we address this gap by analyzing previously underexplored GPU hardware counters collected via Lightweight Distributed Metric Service on Perlmutter, a leadership-class supercomputer. We quantify uneven work distribution across GPUs within a job and the steadiness of GPU activity over time, and we classify jobs as compute- or memory-bound using a roofline-based criterion. We then use these metrics to interpret job behavior in terms of practical workload characteristics to provide interpretable, job-level insights. Our findings can inform workload optimization and future HPC system design. For example, 81% of jobs are memory-bound, and memory-bound jobs tend to consume more energy than compute-bound jobs at comparable GPU-hours. Among jobs requesting 80 GB GPUs, 55% peak at 50% HBM capacity or less.
翻译:GPGPU加速集群与超级计算机是现代高性能计算(HPC)的核心。过去十年间,此类系统持续扩展,GPU现已提供广泛的硬件计数器,能够详细展示性能与资源使用情况。尽管这些计数器潜力巨大,但鲜有研究评估其在大规模实际工作负载分析中所能提供的洞察。本研究通过分析在领导级超级计算机Perlmutter上通过轻量级分布式度量服务收集的、先前未充分探索的GPU硬件计数器,填补了这一空白。我们量化了单个作业内GPU间工作负载分布的不均衡性以及GPU活动随时间变化的稳定性,并采用基于屋顶线模型的标准将作业分类为计算密集型或内存密集型。随后,我们利用这些指标从实际工作负载特征的角度解释作业行为,从而提供可解释的作业级洞察。我们的研究结果可为工作负载优化及未来HPC系统设计提供参考。例如,81%的作业属于内存密集型,且在GPU小时数可比的情况下,内存密集型作业往往比计算密集型作业消耗更多能量。在申请使用80 GB GPU的作业中,55%的作业峰值HBM容量利用率不超过50%。