Sparse matrix multiplication is an important kernel for large-scale graph processing and other data-intensive applications. In this paper, we implement various asynchronous, RDMA-based sparse times dense (SpMM) and sparse times sparse (SpGEMM) algorithms, evaluating their performance running in a distributed memory setting on GPUs. Our RDMA-based implementations use the NVSHMEM communication library for direct, asynchronous one-sided communication between GPUs. We compare our asynchronous implementations to state-of-the-art bulk synchronous GPU libraries as well as a CUDA-aware MPI implementation of the SUMMA algorithm. We find that asynchronous RDMA-based implementations are able to offer favorable performance compared to bulk synchronous implementations, while also allowing for the straightforward implementation of novel work stealing algorithms.
翻译:稀疏矩阵乘法是大规模图处理及其他数据密集型应用中的核心计算内核。本文实现了多种基于远程直接内存访问(RDMA)的异步稀疏乘稠密(SpMM)与稀疏乘稀疏(SpGEMM)算法,并评估了其在GPU分布式内存环境中的运行性能。我们的RDMA实现采用NVSHMEM通信库,支持GPU间直接、异步的单边通信。通过将异步实现与最先进的批量同步GPU库、以及采用CUDA-aware MPI实现的SUMMA算法进行对比,发现基于RDMA的异步实现在性能上优于批量同步方案,同时能更简洁地实现新型工作窃取算法。