基于FFT的块三角Toeplitz矩阵GPU加速算法的混合精度性能可移植性 (Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices)

from arxiv, To appear in Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC Workshops '25), November 16-21, 2025, St Louis, MO, USA

The hardware diversity in leadership-class computing facilities, alongside the immense performance boosts from today's GPUs when computing in lower precision, incentivizes scientific HPC workflows to adopt mixed-precision algorithms and performance portability models. We present an on-the-fly framework using hipify for performance portability and apply it to FFTMatvec - an HPC application that computes matrix-vector products with block-triangular Toeplitz matrices. Our approach enables FFTMatvec, initially a CUDA-only application, to run seamlessly on AMD GPUs with excellent performance. Performance optimizations for AMD GPUs are integrated into the open-source rocBLAS library, keeping the application code unchanged. We then present a dynamic mixed-precision framework for FFTMatvec; a Pareto front analysis determines the optimal mixed-precision configuration for a desired error tolerance. Results are shown for AMD Instinct MI250X, MI300X, and the newly launched MI355X GPUs. The performance-portable, mixed-precision FFTMatvec is scaled to 4,096 GPUs on the OLCF Frontier supercomputer.

翻译：领先级计算设施中的硬件多样性，以及当今GPU在低精度计算时带来的巨大性能提升，激励着科学高性能计算工作流采用混合精度算法与性能可移植模型。我们提出一种基于hipify的即时性能可移植框架，并将其应用于FFTMatvec——一种计算块三角Toeplitz矩阵向量积的高性能计算应用。该方法使原本仅支持CUDA的FFTMatvec能够在AMD GPU上无缝运行并获得优异性能。针对AMD GPU的性能优化已集成至开源的rocBLAS库中，且无需修改应用程序代码。我们进一步为FFTMatvec提出动态混合精度框架；通过帕累托前沿分析确定满足目标误差容限的最优混合精度配置。实验展示了AMD Instinct MI250X、MI300X及新发布的MI355X GPU上的结果。该具备性能可移植性的混合精度FFTMatvec已在OLCF Frontier超级计算机上扩展至4,096块GPU。