The multiplication of two sparse matrices, known as SpGEMM, is a key kernel in scientific computing and large-scale data analytics, underpinning graph algorithms, machine learning, simulations, and computational biology, where sparsity is often highly unstructured. The unstructured sparsity makes achieving high performance challenging because it limits both memory efficiency and scalability. In distributed memory, the cost of exchanging and merging partial products across nodes further constrains performance. These issues are exacerbated on modern heterogeneous supercomputers with deep, hierarchical GPU interconnects. Current SpGEMM implementations overlook the gap between intra-node and inter-node bandwidth, resulting in unnecessary data movement and synchronization not fully exploiting the fast intra-node interconnect. To address these challenges, we introduce Trident, a hierarchy-aware 2D distributed SpGEMM algorithm that uses communication-avoiding techniques and asynchronous communication to exploit the hierarchical and heterogeneous architecture of modern supercomputing interconnect. Central to Trident is the novel trident partitioning scheme, which enables hierarchy-aware decomposition and reduces internode communication by leveraging the higher bandwidth between GPUs within a node compared to across nodes. Here, we evaluate Trident on unstructured matrices, achieving up to $2.38\times$ speedup over a 2D SpGEMM with a corresponding geometric mean speedup of $1.54\times$. Trident reduces internode communication volume by up to $2\times$ on NERSC's Perlmutter supercomputer. Furthermore, we demonstrate the effectiveness of Trident in speeding up Markov Clustering, achieving up to $2\times$ speedup compared to competing strategies.
翻译:稀疏矩阵乘法(SpGEMM)是科学计算和大规模数据分析中的关键核心运算,支撑着图算法、机器学习、仿真模拟和计算生物学等应用领域,其稀疏性往往高度非结构化。非结构化稀疏性会限制内存效率和可扩展性,从而对高性能实现构成挑战。在分布式内存系统中,跨节点交换与合并局部乘积的代价进一步制约了性能表现。这些问题在具有深层分层GPU互连的现代异构超级计算机上更为突出。现有SpGEMM实现忽视了节点内与节点间带宽差异,导致不必要的数据移动和同步,未能充分利用高速节点内互连。为应对这些挑战,我们提出Trident——一种感知层次结构的二维分布式SpGEMM算法,通过通信避免技术与异步通信机制,充分挖掘现代超级计算互连的分层异构架构优势。Trident的核心创新在于新型trident划分方案,该方案能够实现层次感知分解,通过利用节点内GPU间高于节点间互连的带宽优势,有效减少节点间通信。我们在非结构化矩阵上评估Trident,相比二维SpGEMM实现了最高$2.38\times$的加速比,几何平均加速比为$1.54\times$。在NERSC的Perlmutter超级计算机上,Trident将节点间通信量降低了最高$2\times$。此外,我们在马尔可夫聚类算法中验证了Trident的有效性,相较竞争策略实现了最高$2\times$的加速比。