We propose an efficient distributed out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for heterogeneous high-performance-computing (HPC) systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory (OOM) problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/Output (I/O) latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library NCCL based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10e-6.
翻译:我们提出一种针对异构高性能计算(HPC)系统的分布式内存外溢非负矩阵分解(NMF)算法的高效实现方案。该实现基于前期NMFk工作,能够自动进行模型选择并提取数据中的潜在变量与模式。本研究通过在多节点、多GPU系统中支持稠密与稀疏矩阵运算扩展了NMFk算法。优化后的算法专为内存外溢问题设计——即待分解矩阵所需内存超出可用GPU显存时。采用分批次/分块策略降低内存复杂度,并利用GPU核心(或张量核心,若可用)显著加速稀疏与稠密矩阵运算。通过CUDA流异步重叠数据传输与计算,隐藏主机与设备间批量拷贝的输入/输出延迟;采用基于优化NVIDIA集体通信库NCCL的通信器,降低集体通信(节点内与节点间)延迟。基准测试显示,相较基于CPU的NMFk,新实现在GPU上实现32倍至76倍加速。在多达4096个多GPU集群节点(约25000个GPU)上,分解340TB稠密矩阵与密度为10e-6的11EB规模稀疏矩阵时,展现出良好的弱扩展性。