Hierarchical Precision and Recursion for Accelerating Symmetric Linear Solves on MXUs

Symmetric linear solves are fundamental to a wide range of scientific and engineering applications, from climate modeling and structural analysis to machine learning and optimization. These workloads often rely on Cholesky (POTRF) decomposition and its supporting operations, triangular solves (TRSM) and symmetric rank-k updates (SYRK), which together form the computational core for solving symmetric positive-definite systems. To accelerate these kernels, we present a portable, mixed-precision solver designed for Matrix Processing Units (MXUs), including NVIDIA Tensor Cores (H200) and AMD Matrix Cores (MI300X). Our algorithm builds on a nested recursive formulation in which Cholesky exposes parallelism through recursive decomposition of its TRSM and SYRK sub-problems. This structure yields a hierarchical recursion that maximizes GEMM throughput while enabling fine-grained control over numerical precision. We introduce a custom recursive data structure that assigns low-precision FP16 arithmetic to large off-diagonal blocks, while preserving high precision on diagonal blocks to ensure numerical stability. The solver is implemented in Julia, leveraging array programming, multiple dispatch, and dynamic type inference to enable seamless expression of mixed-precision computation. This design provides a high-level, hardware-agnostic interface while efficiently interfacing with low-level vendor libraries for backend portability. On H200, our recursive FP64 SYRK achieves a 14x speedup over cuBLAS, while mixed-precision delivers up to 27x speedup in SYRK and 5x in TRSM over full-precision baselines. This results in a 5x overall speedup for Cholesky versus cuSOLVER FP64, with 100x better accuracy than pure FP16 while retaining 88% of its peak speedup. Comparable performance and accuracy trends are observed on MI300X, demonstrating broad applicability across GPUs.

翻译：对称线性求解是众多科学与工程应用的基础，涵盖气候建模、结构分析、机器学习及优化等领域。这类计算负载通常依赖于Cholesky（POTRF）分解及其支撑操作——三角求解（TRSM）与对称秩-k更新（SYRK），三者共同构成求解对称正定系统的计算核心。为加速这些核心运算，我们提出了一种面向矩阵处理单元（MXU）的可移植混合精度求解器，适用于包括NVIDIA Tensor Cores（H200）与AMD Matrix Cores（MI300X）在内的硬件平台。我们的算法基于一种嵌套递归框架，其中Cholesky分解通过对其TRSM与SYRK子问题进行递归分解来发掘并行性。这一结构形成了层次化递归，在最大化GEMM吞吐量的同时，实现了对数值精度的细粒度控制。我们引入了一种定制递归数据结构，将低精度FP16算术分配给大型非对角块，同时在对角块上保持高精度以确保数值稳定性。该求解器采用Julia语言实现，充分利用数组编程、多重分派与动态类型推断能力，以无缝表达混合精度计算。该设计提供了高层级、硬件无关的接口，同时高效对接底层厂商库以实现后端可移植性。在H200上，我们的递归FP64 SYRK相比cuBLAS实现了14倍加速，而混合精度方案在SYRK和TRSM上分别较全精度基线取得最高27倍和5倍加速。这使得Cholesky分解相比cuSOLVER FP64整体获得5倍加速，在保持纯FP16方案88%峰值加速的同时，精度提升达100倍。在MI300X平台上观察到可比的性能与精度趋势，证明了该方法在各类GPU上的广泛适用性。