We investigate the energy efficiency of a library designed for parallel computations with sparse matrices. The library leverages high-performance, energy-efficient Graphics Processing Unit (GPU) accelerators to enable large-scale scientific applications. Our primary development objective was to maximize parallel performance and scalability in solving sparse linear systems whose dimensions far exceed the memory capacity of a single node. To this end, we devised methods that expose a high degree of parallelism while optimizing algorithmic implementations for efficient multi-GPU usage. Previous work has already demonstrated the library's performance efficiency on large-scale systems comprising thousands of NVIDIA GPUs, achieving improvements over state-of-the-art solutions. In this paper, we extend those results by providing energy profiles that address the growing sustainability requirements of modern HPC platforms. We present our methodology and tools for accurate runtime energy measurements of the library's core components and discuss the findings. Our results confirm that optimizing GPU computations and minimizing data movement across memory and computing nodes reduces both time-to-solution and energy consumption. Moreover, we show that the library delivers substantial advantages over comparable software frameworks on standard benchmarks.
翻译:我们研究了一个专为稀疏矩阵并行计算设计的库的能效。该库利用高性能、高能效的图形处理器加速器,以支持大规模科学应用。我们的主要开发目标是,在求解维度远超单节点内存容量的稀疏线性系统时,最大化并行性能和可扩展性。为此,我们设计了能暴露高度并行性的方法,同时优化算法实现以实现高效的多GPU使用。先前的工作已证明该库在包含数千个NVIDIA GPU的大规模系统上具有性能优势,相较于现有先进解决方案取得了改进。在本文中,我们通过提供能效分析来扩展这些结果,以应对现代高性能计算平台日益增长的可持续性要求。我们介绍了对库核心组件进行精确运行时能耗测量的方法与工具,并讨论了相关发现。我们的结果证实,优化GPU计算并最小化跨内存和计算节点的数据移动,既能减少求解时间,也能降低能耗。此外,我们表明该库在标准基准测试上相较于同类软件框架具有显著优势。