We consider the problem of computing a QR (or QZ) decomposition of a real, dense, tall and very skinny matrix. That is, the number of columns is tiny compared to the number of rows, rendering most computations completely or partially memory-bandwidth limited. The paper focuses on recent NVIDIA GPGPUs still supporting 64-bit floating-point arithmetic, but the findings carry over to AMD GPUs as well. We discuss two basic algorithms: Methods based on the normal equations (Gram matrix), in particular Cholesky-QR2 and SVQB, and the "tall-skinny QR" (TSQR), based on Householder transformations in a tree-reduction scheme. We propose two primary optimization techniques: Avoiding the write-back of the Q factor ("Q-less QR"), and exploiting fast local memory (shared memory on GPUs). We compare a straight-forward implementation of Gramian-based methods, and a more sophisticated TSQR implementation, in terms of performance achieved, time-to-solution, and implementation complexity. By performance modelling and numerical experiments with our own code and a vendor-optimized library routine, we demonstrate the crucial need for specialized methods and implementations in this memory-bound to transitional (memory/compute-bound) regime, and that TSQR is competitive in terms of time-to-solution, but at the cost of an investment in low-level code optimization.
翻译:我们考虑对实、稠密、高且非常瘦的矩阵进行QR(或QZ)分解的计算问题。即列数相对于行数非常小,使得大多数计算完全或部分受限于内存带宽。本文聚焦于仍支持64位浮点运算的近期NVIDIA GPGPU,但研究结果同样适用于AMD GPU。我们讨论两种基本算法:基于正规方程(Gram矩阵)的方法,特别是Cholesky-QR2和SVQB,以及基于Householder变换的树归约方案的“高瘦QR”(TSQR)。我们提出两种主要优化技术:避免回写Q因子(“无Q的QR”),以及利用快速本地内存(GPU上的共享内存)。我们从性能、求解时间和实现复杂度三个方面,比较了基于Gram矩阵方法的直接实现与更复杂的TSQR实现。通过性能建模和针对我们自己的代码以及供应商优化库例程的数值实验,我们证明在此内存受限到过渡(内存/计算受限)的状态下,专用方法和实现的至关重要性,并表明TSQR在求解时间方面具有竞争力,但代价是需要投入底层代码优化。