基于静态任务调度的混合精度外存Cholesky分解加速 (Accelerating Mixed-Precision Out-of-Core Cholesky Factorization with Static Task Scheduling)

This paper explores the performance optimization of out-of-core (OOC) Cholesky factorization on shared-memory systems equipped with multiple GPUs. We employ fine-grained computational tasks to expose concurrency while creating opportunities to overlap data movement asynchronously with computations, especially when dealing with matrices that cannot fit on the GPU memory. We leverage the directed acyclic graph of the task-based Cholesky factorization and map it onto a static scheduler that promotes data reuse while supporting strategies for reducing data movement with the CPU host when the GPU memory is exhausted. The CPU-GPU interconnect may become the main performance bottleneck as the gap between the GPU execution rate and the traditional PCIe bandwidth continues to widen. While the surface-to-volume effect of compute-bound kernels partially mitigates the overhead of data motion, deploying mixed-precision (MxP) computations exacerbates the throughput discrepancy. Using static task scheduling, we evaluate the performance capabilities of the new ultra-fast NVIDIA chip interconnect technology, codenamed NVLink-C2C, that constitutes the backbone of the NVIDIA Grace Hopper Superchip (GH200), against a new four-precision (FP64/FP32/FP16/FP8) left-looking Cholesky factorization. We report the performance results of a benchmarking campaign on various NVIDIA GPU generations and interconnects. We highlight 20% performance superiority against cuSOLVER on a single GH200 with FP64 while hiding the cost of OOC task-based Cholesky factorization, and we scale almost linearly on four GH200 superships. With MxP enabled, our statically scheduled four-precision tile-based Cholesky factorization scores a 3X performance speedup against its FP64-only counterpart, delivering application-worthy FP64 accuracy when modeling a large-scale geospatial statistical application.

翻译：本文研究了在配备多GPU的共享内存系统上外存（OOC）Cholesky分解的性能优化。我们采用细粒度计算任务以发掘并发性，同时创造机会使数据移动与计算异步重叠，尤其是在处理无法完全放入GPU内存的矩阵时。我们利用基于任务的Cholesky分解的有向无环图，并将其映射到静态调度器上，该调度器在GPU内存耗尽时支持减少与CPU主机间数据移动的策略，同时促进数据重用。随着GPU执行速率与传统PCIe带宽之间的差距持续扩大，CPU-GPU互连可能成为主要性能瓶颈。虽然计算密集型核函数的表面积-体积效应部分缓解了数据移动的开销，但部署混合精度（MxP）计算加剧了吞吐量差异。利用静态任务调度，我们评估了新型超高速NVIDIA芯片互连技术（代号NVLink-C2C）的性能潜力，该技术构成了NVIDIA Grace Hopper超级芯片（GH200）的骨干，并针对一种新的四精度（FP64/FP32/FP16/FP8）左视Cholesky分解进行了测试。我们报告了在不同NVIDIA GPU代际和互连上的基准测试性能结果。我们展示了在单个GH200上使用FP64时相比cuSOLMER有20%的性能优势，同时隐藏了基于任务的OOC Cholesky分解成本，并且在四个GH200超级芯片上实现了近乎线性的扩展。启用MxP后，我们静态调度的基于分块的四精度Cholesky分解相比仅使用FP64的版本实现了3倍的性能加速，在模拟大规模地理空间统计应用时提供了满足应用需求的FP64精度。