The multiplication of two sparse matrices, known as SpGEMM, is a key kernel in scientific computing and large-scale data analytics, underpinning graph algorithms, machine learning, simulations, and computational biology, where sparsity is often highly unstructured. The unstructured sparsity makes achieving high performance challenging because it limits both memory efficiency and scalability. In distributed memory, the cost of exchanging and merging partial products across nodes further constrains performance. These issues are exacerbated on modern heterogeneous supercomputers with deep, hierarchical GPU interconnects. Current SpGEMM implementations overlook the gap between intra-node and inter-node bandwidth, resulting in unnecessary data movement and synchronization not fully exploiting the fast intra-node interconnect. To address these challenges, we introduce Trident, a hierarchy-aware 2D distributed SpGEMM algorithm that uses communication-avoiding techniques and asynchronous communication to exploit the hierarchical and heterogeneous architecture of modern supercomputing interconnect. Central to Trident is the novel trident partitioning scheme, which enables hierarchy-aware decomposition and reduces internode communication by leveraging the higher bandwidth between GPUs within a node compared to across nodes. Here, we evaluate Trident on unstructured matrices, achieving up to $2.38\times$ speedup over a 2D SpGEMM with a corresponding geometric mean speedup of $1.54\times$. Trident reduces internode communication volume by up to $2\times$ on NERSC's Perlmutter supercomputer. Furthermore, we demonstrate the effectiveness of Trident in speeding up Markov Clustering, achieving up to $2\times$ speedup compared to competing strategies.
翻译:稀疏矩阵乘法(SpGEMM)作为科学计算与大规模数据分析领域的核心算子,支撑着图算法、机器学习、数值模拟及计算生物学等应用场景,其稀疏性往往呈现高度非结构化特征。非结构化稀疏性限制内存效率与可扩展性,导致高性能实现面临严峻挑战。在分布式内存架构中,节点间交换与合并部分乘积的通信开销进一步制约性能,而现代异构超级计算机中层次化GPU互连的深度架构更加剧了该问题。现有SpGEMM实现忽视节点内与节点间带宽差异,引发不必要的数据移动与同步操作,未能充分利用高速节点内互连。针对上述挑战,我们提出Trident——一种感知层级结构的二维分布式SpGEMM算法,通过通信避免技术与异步通信机制,充分挖掘现代超算互连的层次化异构特性。其核心创新在于新型trident分区方案:通过感知层级拓扑的分解策略,利用节点内GPU间更高带宽特性有效减少节点间通信。实验表明,在非结构化矩阵测试中,Trident相比传统二维SpGEMM方法取得最高2.38倍加速比(几何平均1.54倍),在NERSC Perlmutter超算上实现节点间通信量降低至多2倍。此外,我们在马尔可夫聚类加速应用中验证了Trident的有效性,较对比策略取得最高2倍加速效果。