Accelerating Bidiagonalization of Banded Matrices through Memory-Aware Bulge-Chasing on GPUs

The reduction of a banded matrix to bidiagonal form is a critical step in the calculation of Singular Values, a cornerstone of scientific computing and AI. Although inherently parallel, this step has traditionally been considered unsuitable for GPUs due to its memory-bound nature. However, recent advances in GPU architectures, such as increased L1 memory per Streaming Multiprocessor or Compute Unit and larger L2 caches, have shifted this paradigm. In this work, we present the first GPU-accelerated algorithm for reducing a banded matrix to bidiagonal form, integrated into open-source software package NextLA$.$jl. Our algorithm builds on prior multicore CPU cache-efficient bulge chasing methods, adapted to modern GPU architecture to optimize throughput. Leveraging Julia's high-level array abstractions and KernelAbstractions, we implement a single function that is both hardware-agnostic and data-precision-aware, running efficiently across NVIDIA, AMD, Intel, and Apple Metal GPUs. We develop a hardware-aware performance model to guide tuning and identify key hyperparameters that govern optimal GPU performance for memory-bound workloads. We show that such workloads, when carefully optimized, can achieve substantial speed-ups on modern GPUs: our implementation outperforms multithreaded CPU libraries PLASMA and SLATE starting from matrix sizes as small as 1024 x 1024, and achieves over 100x speed-up on 32k x 32k matrices. Moreover, the algorithm's performance scales linearly with the matrix bandwidth, enabling efficient reduction of matrices with larger bandwidths - previously considered impractical.

翻译：将带状矩阵约化为双对角形式是计算奇异值的关键步骤，而奇异值计算是科学计算与人工智能的基石。尽管该步骤本质上是并行的，但由于其内存受限的特性，传统上被认为不适合在GPU上执行。然而，GPU架构的最新进展，例如每个流式多处理器或计算单元增加的L1内存以及更大的L2缓存，已经改变了这一范式。本文提出了首个用于将带状矩阵约化为双对角形式的GPU加速算法，并集成到开源软件包NextLA.jl中。我们的算法建立在先前多核CPU缓存高效凸点追逐方法的基础上，针对现代GPU架构进行适配以优化吞吐量。利用Julia的高级数组抽象和KernelAbstractions，我们实现了一个同时具备硬件无关性和数据精度感知的单一函数，可在NVIDIA、AMD、Intel及Apple Metal GPU上高效运行。我们开发了一个硬件感知性能模型来指导调优，并识别出决定内存受限工作负载在GPU上最优性能的关键超参数。研究表明，经过精心优化，此类工作负载在现代GPU上能实现显著的加速：我们的实现在矩阵尺寸小至1024 x 1024时即超越多线程CPU库PLASMA和SLATE，并在32k x 32k矩阵上实现超过100倍的加速。此外，算法性能随矩阵带宽线性扩展，使得更大带宽矩阵的高效约化成为可能——这在以往被认为是不切实际的。