CholeskyQR2 and shifted CholeskyQR3 are two state-of-the-art algorithms for computing tall-and-skinny QR factorizations since they attain high performance on current computer architectures. However, to guarantee stability, for some applications, CholeskyQR2 faces a prohibitive restriction on the condition number of the underlying matrix to factorize. Shifted CholeskyQR3 is stable but has $50\%$ more computational and communication costs than CholeskyQR2. In this paper, a randomized QR algorithm called Randomized Householder-Cholesky (\texttt{rand\_cholQR}) is proposed and analyzed. Using one or two random sketch matrices, it is proved that with high probability, its orthogonality error is bounded by a constant of the order of unit roundoff for any numerically full-rank matrix, and hence it is as stable as shifted CholeskyQR3. An evaluation of the performance of \texttt{rand\_cholQR} on a NVIDIA A100 GPU demonstrates that for tall-and-skinny matrices, \texttt{rand\_cholQR} with multiple sketch matrices is nearly as fast as, or in some cases faster than, CholeskyQR2. Hence, compared to CholeskyQR2, \texttt{rand\_cholQR} is more stable with almost no extra computational or memory cost, and therefore a superior algorithm both in theory and practice.
翻译:CholeskyQR2 与移位 CholeskyQR3 是当前计算高瘦矩阵 QR 分解的两种先进算法,因其在现代计算机架构上能实现高性能。然而,为保证稳定性,在某些应用中,CholeskyQR2 对所需分解矩阵的条件数存在严格的限制。移位 CholeskyQR3 虽稳定,但其计算与通信开销比 CholeskyQR2 高出 50%。本文提出并分析了一种称为随机化 Householder-Cholesky(\texttt{rand\_cholQR})的随机化 QR 算法。通过使用一个或两个随机草图矩阵,本文证明该算法以高概率保证其正交性误差被限制在单位舍入误差量级的常数范围内,适用于任何数值满秩矩阵,因此其稳定性与移位 CholeskyQR3 相当。在 NVIDIA A100 GPU 上对 \texttt{rand\_cholQR} 的性能评估表明,对于高瘦矩阵,采用多重草图矩阵的 \texttt{rand\_cholQR} 速度几乎与 CholeskyQR2 相当,甚至在某些情况下更快。因此,与 CholeskyQR2 相比,\texttt{rand\_cholQR} 在几乎不增加计算或内存开销的情况下实现了更高的稳定性,是一种在理论与实践上均更优越的算法。