The factorization of skew-symmetric matrices is a critically understudied area of dense linear algebra (DLA), particularly in comparison to that of symmetric matrices. While some algorithms can be adapted from the symmetric case, the cost of algorithms can be reduced by exploiting skew-symmetry. A motivating example is the factorization $X=LTL^T$ of a skew-symmetric matrix $X$, which is used in practical applications as a means of determining the determinant of $X$ as the square of the (cheaply-computed) Pfaffian of the skew-symmetric tridiagonal matrix $T$, for example in fields such as quantum electronic structure and machine learning. Such applications also often require pivoting in order to improve numerical stability. In this work we explore a combination of known literature algorithms and new algorithms recently derived using formal methods. High-performance parallel CPU implementations are created, leveraging the concept of fusion at multiple levels in order to reduce memory traffic overhead, as well as the BLIS framework which provides high-performance GEMM kernels, hierarchical parallelism, and cache blocking. We find that operation fusion and improved use of available bandwidth via parallelization of bandwidth-bound (level-2 BLAS) operations are essential for obtaining high performance, while a concise C++ implementation provides a clear and close connection to the formal derivation process without sacrificing performance.
翻译:斜对称矩阵的分解是稠密线性代数(DLA)中一个研究严重不足的领域,特别是与对称矩阵的分解相比。虽然某些算法可以从对称情形中借鉴,但通过利用斜对称性可以降低算法的计算成本。一个典型的例子是斜对称矩阵 $X$ 的分解 $X=LTL^T$,该分解在实际应用中被用作一种手段,通过计算斜对称三对角矩阵 $T$ 的(易于计算的)普法夫行列式的平方来确定 $X$ 的行列式,例如在量子电子结构和机器学习等领域。此类应用通常也需要进行选主元以提高数值稳定性。在本工作中,我们探索了结合已知文献算法与近期通过形式化方法推导的新算法。我们创建了高性能的并行CPU实现,利用多层次融合的概念来减少内存流量开销,并借助BLIS框架——该框架提供了高性能的GEMM内核、分层并行性和缓存分块。我们发现,操作融合以及通过对带宽受限(Level-2 BLAS)操作进行并行化来改进可用带宽的利用,对于获得高性能至关重要,而简洁的C++实现则在不牺牲性能的前提下,与形式化推导过程保持了清晰且紧密的联系。