With the recent emergence of mixed precision hardware, there has been a renewed interest in its use for solving numerical linear algebra problems fast and accurately. The solution of total least squares problems, i.e., solving $\min_{E,r} \| [E, r]\|_F$ subject to $(A+E)x=b+r$, arises in numerous applications. Solving this problem requires finding the smallest singular value and corresponding right singular vector of $[A,b]$, which is challenging when $A$ is large and sparse. An efficient algorithm for this case due to Bj\"{o}rck et al. [SIAM J. Matrix Anal. Appl. 22(2), 2000], called RQI-PCGTLS, is based on Rayleigh quotient iteration coupled with the preconditioned conjugate gradient method. We develop a mixed precision variant of this algorithm, RQI-PCGTLS-MP, in which up to three different precisions can be used. We assume that the lowest precision is used in the computation of the preconditioner, and give theoretical constraints on how this precision must be chosen to ensure stability. In contrast to standard least squares, for total least squares, the resulting constraint depends not only on the matrix $A$, but also on the right-hand side $b$. We perform a number of numerical experiments on model total least squares problems used in the literature, which demonstrate that our algorithm can attain the same accuracy as RQI-PCGTLS albeit with a potential convergence delay due to the use of low precision. Performance modeling shows that the mixed precision approach can achieve up to a $4\times$ speedup depending on the size of the matrix and the number of Rayleigh quotient iterations performed.
翻译:随着混合精度硬件的近期出现,人们重新对其在快速精确求解数值线性代数问题中的用途产生兴趣。全最小二乘问题的解,即求解 $\min_{E,r} \| [E, r]\|_F$ 满足 $(A+E)x=b+r$,在众多应用中都有涉及。求解该问题需要找出 $[A,b]$ 的最小奇异值及对应的右奇异向量,当 $A$ 大且稀疏时,这颇具挑战性。Björck 等人 [SIAM J. Matrix Anal. Appl. 22(2), 2000] 针对此情况提出了一种高效算法,称为 RQI-PCGTLS,该算法基于瑞利商迭代与预处理共轭梯度法相结合。我们开发了该算法的混合精度变体 RQI-PCGTLS-MP,其中可使用多达三种不同精度。我们假设最低精度用于预处理器的计算,并给出了为保证稳定性必须如何选择该精度的理论约束。与标准最小二乘不同,对于全最小二乘,所得约束不仅取决于矩阵 $A$,还取决于右侧向量 $b$。我们在文献中使用的模型全最小二乘问题上进行了多项数值实验,结果表明,尽管由于使用低精度可能导致潜在的收敛延迟,我们的算法可达到与 RQI-PCGTLS 相同的精度。性能建模显示,根据矩阵大小和执行的瑞利商迭代次数,混合精度方法可实现高达 $4\times$ 的加速。