In this paper we present a novel algorithm developed for computing the QR factorisation of extremely ill-conditioned tall-and-skinny matrices on distributed memory systems. The algorithm is based on the communication-avoiding CholeskyQR2 algorithm and its block Gram-Schmidt variant. The latter improves the numerical stability of the CholeskyQR2 algorithm and significantly reduces the loss of orthogonality even for matrices with condition numbers up to $10^{15}$. Currently, there is no distributed GPU version of this algorithm available in the literature which prevents the application of this method to very large matrices. In our work we provide a distributed implementation of this algorithm and also introduce a modified version that improves the performance, especially in the case of extremely ill-conditioned matrices. The main innovation of our approach lies in the interleaving of the CholeskyQR steps with the Gram-Schmidt orthogonalisation, which ensures that update steps are performed with fully orthogonalised panels. The obtained orthogonality and numerical stability of our modified algorithm is equivalent to CholeskyQR2 with Gram-Schmidt and other state-of-the-art methods. Weak scaling tests performed with our test matrices show significant performance improvements. In particular, our algorithm outperforms state-of-the-art Householder-based QR factorisation algorithms available in ScaLAPACK by a factor of $6$ on CPU-only systems and up to $80\times$ on GPU-based systems with distributed memory.
翻译:本文提出了一种新颖算法,用于在分布式内存系统上计算极端病态高瘦矩阵的QR分解。该算法基于通信避免型CholeskyQR2算法及其块格拉姆-施密特变体。后者提升了CholeskyQR2算法的数值稳定性,即使对于条件数高达$10^{15}$的矩阵,也能大幅降低正交性损失。目前,文献中尚无此算法的分布式GPU版本,导致该方法无法应用于超大规模矩阵。在我们的工作中,我们提供了该算法的分布式实现,并引入了一种改进版本以提升性能,尤其是在处理极端病态矩阵时。该方法的主要创新在于将CholeskyQR步骤与格拉姆-施密特正交化交错执行,从而确保更新步骤使用完全正交化的面板进行。改进后算法获得的正交性与数值稳定性与结合格拉姆-施密特的CholeskyQR2及其他前沿方法相当。使用测试矩阵进行的弱扩展基准测试显示出显著的性能提升。特别是,我们的算法在仅使用CPU的系统上比ScaLAPACK中基于Householder的QR分解算法性能提升6倍,在基于GPU的分布式内存系统上性能提升高达80倍。