The reduction of a banded matrix to bidiagonal form is a critical step in the calculation of Singular Values, a cornerstone of scientific computing and AI. Although inherently parallel, this step has traditionally been considered unsuitable for GPUs due to its memory-bound nature. However, recent advances in GPU architectures, such as increased L1 memory per Streaming Multiprocessor or Compute Unit and larger L2 caches, have shifted this paradigm. In this work, we present the first GPU-accelerated algorithm for reducing a banded matrix to bidiagonal form, integrated into an open-source software package. Our algorithm builds on prior multicore CPU cache-efficient bulge-chasing methods, adapted to modern GPU architectures to optimize throughput. Leveraging Julia's high-level array abstractions and KernelAbstractions.jl, we implement a single function that is both hardware-agnostic and data-precision-aware, running efficiently across NVIDIA, AMD, Intel, and Apple Metal GPUs. We develop a hardware-aware performance model to guide tuning and identify key hyperparameters that govern optimal GPU performance for memory-bound workloads. We show that such workloads, when carefully optimized, can achieve substantial speed-ups on modern GPUs: our implementation outperforms multithreaded CPU libraries (PLASMA,SLATE) starting from matrix sizes as small as 1024x1024, and achieves over 100x speed-up on 32k x 32k matrices. Moreover, the algorithm's performance scales linearly with the matrix bandwidth, enabling efficient reduction of matrices with larger bandwidths, previously considered impractical.
翻译:带状矩阵到双对角形式的约化是奇异值计算中的关键步骤,而奇异值计算是科学计算与人工智能的基石。尽管该步骤具有天然并行性,但传统上因其内存受限特性被认为不适合在GPU上执行。然而,近期GPU架构的进步——例如每个流式多处理器或计算单元增加的L1缓存以及更大的L2缓存——已改变了这一范式。本文提出了首个集成于开源软件包中的GPU加速算法,用于将带状矩阵约化为双对角形式。我们的算法基于先前针对多核CPU设计的缓存高效凸包追踪方法,并针对现代GPU架构进行了适配以优化吞吐量。借助Julia的高阶数组抽象和KernelAbstractions.jl,我们实现了单一函数,该函数既与硬件无关,又能感知数据精度,可在NVIDIA、AMD、Intel及Apple Metal GPU上高效运行。我们开发了硬件感知性能模型以指导调优,并识别出控制内存受限工作负载在GPU上最优性能的关键超参数。研究表明:经过精心优化后,此类工作负载可在现代GPU上实现显著加速——我们的实现从矩阵尺寸小至1024x1024开始即超越多线程CPU库(PLASMA、SLATE),而在32k×32k矩阵上实现了超过100倍的加速。此外,算法性能随矩阵带宽线性扩展,使得此前被认为不可行的大带宽矩阵的高效约化成为可能。