Artificial Intelligence (AI) workloads drive a rapid expansion of high-performance computing (HPC) infrastructures and increase their power and energy demands towards a critical level. AI benchmarks representing state-of-the art workloads and their understanding in the context of performance-energy trade-offs are critical to deploy efficient infrastructures and can guide energy efficiency measures, such as power capping. We introduce a benchmarking framework with popular deep learning applications from computer vision (image classification and generation) and large language models (continued pre-training and inference) implementing modern methods. Our performance analysis focuses on throughput rather than time to "completion", which is the standard metric in HPC. We analyse performance and energy efficiency under various power capping scenarios on NVIDIA H100, NVIDIA H200, and AMD MI300X GPUs. Our results reveal that no universal optimal power cap exists, as the efficiency peak varies across application types and GPU architectures. Interestingly, the two NVIDIA GPUs which mainly differ in their HBM configuration show qualitatively different performance-energy trade-offs. The developed benchmarking framework will be released as a public tool.
翻译:人工智能(AI)工作负载驱动着高性能计算(HPC)基础设施的快速扩张,并使其功耗与能源需求逼近临界水平。代表前沿工作负载的AI基准测试及其在性能-能耗权衡背景下的理解,对于部署高效基础设施至关重要,并能指导能效措施(如功耗封顶)的实施。我们引入了一个基准测试框架,其中包含来自计算机视觉(图像分类与生成)和大型语言模型(持续预训练与推理)的流行深度学习应用,这些应用实现了现代方法。我们的性能分析侧重于吞吐量而非“完成时间”——后者是HPC领域的标准度量指标。我们在NVIDIA H100、NVIDIA H200和AMD MI300X GPU上,分析了多种功耗封顶场景下的性能与能效。结果表明,不存在普适的最优功耗封顶值,因为能效峰值因应用类型和GPU架构而异。有趣的是,两款主要在HBM配置上存在差异的NVIDIA GPU,在性能-能耗权衡方面表现出质的差异。所开发的基准测试框架将作为公开工具发布。