Distributed Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental operation in high-performance computing and deep learning applications. The major performance bottleneck in distributed SpMM lies in substantial communication overhead, which limits both performance and scalability. In this paper, we identify two key sources of communication inefficiency in distributed SpMM: redundant data transfer due to sparsity unawareness, and suboptimal utilization of hierarchical network topology. To address these, we propose (1) a fine-grained, sparsity-aware communication strategy that reduces communication overhead by exploiting the sparsity pattern of the sparse matrix, and (2) a hierarchical communication strategy that maps the sparsity-aware strategy onto two-tier GPU network architectures, minimizing redundant data movement across slower inter-node links. We implement these optimizations in a comprehensive distributed SpMM framework, \method{}. Extensive evaluations on real-world datasets show that \method{} demonstrates strong scalability up to 128 GPUs, achieving geometric mean speedups of 221.5$\times$, 56.0$\times$, 23.4$\times$, and 8.8$\times$ in SpMM over four state-of-the-art baselines (CAGNET, SPA, BCL, and CoLa, respectively) at this scale.
翻译:分布式稀疏矩阵乘法是高性能计算与深度学习应用中的基础运算。其性能瓶颈主要在于巨大的通信开销,这限制了性能与可扩展性。本文揭示了分布式稀疏矩阵乘法通信效率低下的两大根源:一是因忽略稀疏性导致的冗余数据传输,二是对层级网络拓扑的利用不充分。为此,我们提出:(1)一种细粒度、感知稀疏性的通信策略,通过利用稀疏矩阵的稀疏模式降低通信开销;(2)一种层级化通信策略,将感知稀疏性的策略映射到双层级GPU网络架构上,从而最小化跨慢速节点间链路的冗余数据移动。我们将这些优化集成于全面的分布式稀疏矩阵乘框架\method{}中。基于真实数据集的广泛评估表明,\method{}在128个GPU上展现出强可扩展性,在此规模下,其在稀疏矩阵乘运算中相较于四种先进基线方法(CAGNET、SPA、BCL、CoLa)分别实现了221.5倍、56.0倍、23.4倍和8.8倍的几何平均加速比。