Various numerical methods used for solving partial differential equations (PDE) result in tridiagonal systems. Solving tridiagonal systems on distributed-memory environments is not straightforward, and often requires significant amount of communication. In this article, we present a novel distributed-memory tridiagonal solver algorithm, DistD2-TDS, based on a specialised data structure. DistD2-TDS algorithm takes advantage of the diagonal dominance in tridiagonal systems to reduce the communications in distributed-memory environments. The underlying data structure plays a crucial role for the performance of the algorithm. First, the data structure improves data localities and makes it possible to minimise data movements via cache blocking and kernel fusion strategies. Second, data continuity enables a contiguous data access pattern and results in efficient utilisation of the available memory bandwidth. Finally, the data layout supports vectorisation on CPUs and thread level parallelisation on GPUs for improved performance. In order to demonstrate the robustness of the algorithm, we implemented and benchmarked the algorithm on CPUs and GPUs. We investigated the single rank performance and compared against existing algorithms. Furthermore, we analysed the strong scaling of the implementation up to 384 NVIDIA H100 GPUs and up to 8192 AMD EPYC 7742 CPUs. Finally, we demonstrated a practical use case of the algorithm by using compact finite difference schemes to solve a 3D non-linear PDE. The results demonstrate that DistD2 algorithm can sustain around 66% of the theoretical peak bandwidth at scale on CPU and GPU based supercomputers.
翻译:用于求解偏微分方程(PDE)的各种数值方法常导致三对角系统。在分布式内存环境中求解三对角系统并非易事,通常需要大量的通信。本文提出了一种基于专用数据结构的新型分布式内存三对角求解算法DistD2-TDS。DistD2-TDS算法利用三对角系统的对角占优特性,以减少分布式内存环境中的通信。底层数据结构对算法性能起着至关重要的作用。首先,该数据结构改善了数据局部性,并使得通过缓存分块与内核融合策略最小化数据移动成为可能。其次,数据连续性支持连续的数据访问模式,从而实现对可用内存带宽的高效利用。最后,该数据布局支持CPU上的向量化与GPU上的线程级并行化,以提升性能。为验证算法的鲁棒性,我们在CPU和GPU上实现并基准测试了该算法。我们研究了单进程性能,并与现有算法进行了比较。此外,我们分析了该实现在多达384个NVIDIA H100 GPU和8192个AMD EPYC 7742 CPU上的强可扩展性。最后,我们通过使用紧致有限差分格式求解一个三维非线性PDE,展示了该算法的实际应用案例。结果表明,DistD2算法在基于CPU和GPU的超级计算机上大规模运行时,能够维持约66%的理论峰值带宽。