The eigenvalue decomposition (EVD) of (a batch of) Hermitian matrices of order two has a role in many numerical algorithms, of which the one-sided Jacobi method for the singular value decomposition (SVD) is the prime example. In this paper the batched EVD is vectorized, with a vector-friendly data layout and the AVX-512 SIMD instructions of Intel CPUs, alongside other key components of a real and a complex OpenMP-parallel Jacobi-type SVD method, inspired by the sequential xGESVJ routines from LAPACK. These vectorized building blocks should be portable to other platforms that support similar vector operations. Unconditional numerical reproducibility is guaranteed for the batched EVD, sequential or threaded, and for the column transformations, that are, like the scaled dot-products, presently sequential but can be threaded if nested parallelism is desired. No avoidable overflow of the results can occur with the proposed EVD or the whole SVD. The measured accuracy of the proposed EVD often surpasses that of the xLAEV2 routines from LAPACK. While the batched EVD outperforms the matching sequence of xLAEV2 calls, speedup of the parallel SVD is modest but can be improved and is already beneficial with enough threads. Regardless of their number, the proposed SVD method gives identical results, but of somewhat lower accuracy than xGESVJ.
翻译:(批量)二阶埃尔米特矩阵的特征值分解(EVD)在许多数值算法中扮演重要角色,其中奇异值分解(SVD)的单边雅可比方法是最典型的例子。本文通过向量友好的数据布局和Intel CPU的AVX-512 SIMD指令,对批量EVD进行了向量化处理,同时结合了实数和复数OpenMP并行雅可比型SVD方法中的其他关键组件——这些组件受LAPACK中sequential xGESVJ例程启发。这些向量化的基本模块应可移植到支持类似向量操作的其他平台。对于批量EVD(无论是串行还是多线程)以及列变换(如同缩放点积,当前为串行实现,但若需要嵌套并行性则可进行线程化),均保证无条件数值可重现性。所提出的EVD或完整SVD不会导致结果出现可避免的溢出。所提出的EVD测量精度通常优于LAPACK的xLAEV2例程。尽管批量EVD的性能优于匹配的xLAEV2调用序列,但并行SVD的加速效果适中,不过仍可改进,且在足够线程数下已能带来收益。无论线程数量如何,所提出的SVD方法均能产生相同结果,但其精度略低于xGESVJ。