Structured dense matrices result from boundary integral problems in electrostatics and geostatistics, and also Schur complements in sparse preconditioners such as multi-frontal methods. Exploiting the structure of such matrices can reduce the time for dense direct factorization from $O(N^3)$ to $O(N)$. The Hierarchically Semi-Separable (HSS) matrix is one such low rank matrix format that can be factorized using a Cholesky-like algorithm called ULV factorization. The HSS-ULV algorithm is highly parallel because it removes the dependency on trailing sub-matrices at each HSS level. However, a key merge step that links two successive HSS levels remains a challenge for efficient parallelization. In this paper, we use an asynchronous runtime system PaRSEC with the HSS-ULV algorithm. We compare our work with STRUMPACK and LORAPO, both state-of-the-art implementations of dense direct low rank factorization, and achieve up to 2x better factorization time for matrices arising from a diverse set of applications on up to 128 nodes of Fugaku for similar or better accuracy for all the problems that we survey.
翻译:结构化稠密矩阵源于静电学和地质统计学中的边界积分问题,以及多前沿法等稀疏预处理器中的舒尔补。利用此类矩阵的结构可将稠密直接分解的时间从$O(N^3)$降至$O(N)$。层级半可分离(HSS)矩阵是一种低秩矩阵格式,可通过称为ULV分解的类Cholesky算法进行分解。HSS-ULV算法具有高度并行性,因为它消除了每个HSS层级对后续子矩阵的依赖。然而,连接两个连续HSS层级的关键合并步骤仍是高效并行化的挑战。本文采用异步运行时系统PaRSEC与HSS-ULV算法相结合。我们与STRUMPACK和LORAPO(两者均为当前最先进的稠密直接低秩分解实现)进行比较,在Fugaku超级计算机的128个节点上,针对来自多种应用场景的矩阵,我们实现了高达2倍的分解时间提升,同时所研究的所有问题均达到相似或更高的精度。