Optimizing Distributed ML Communication with Fused Computation-Collective Operations

In order to satisfy their ever increasing capacity and compute requirements, machine learning models are distributed across multiple nodes using numerous parallelism strategies. As a result, collective communications are often on the critical path, and hiding their latency by overlapping kernel-granular communication and computation is difficult due to the absence of independent computation. In this work, we propose fusing computation with dependent collective communication by leveraging GPUs' massive parallelism and GPU-initiated communication. We have developed self-contained GPU kernels where workgroups (WGs) immediately communicate their results to remote GPUs when they complete their computation. Meanwhile, other WGs within the same kernel perform overlapping computation, maintaining high ALU utilization. We demonstrate our approach by creating three prototype fused operators (embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All) to address the pervasive communication overheads observed in DLRM, Transformers and MoE model architectures. In order to demonstrate that our approach can be integrated into ML frameworks for wide adoption in production environments, we expose our fused operators as new PyTorch operators as well as extend the Triton framework to enable them. Our evaluations show that our approach can effectively overlap communication with computations, subsequently reducing their combined execution time than the current collective library-based approaches. Our scale-up GEMV + AllReduce and GEMM + All-to-All implementations achieve up to 22% and 20% lower execution time, while our fused embedding + All-to-All reduces execution time by 20% and 31% for intra-node and inter-node configurations. Large scale-out simulations indicate that our approach reduces DLRM execution time by 21% for 128 node system.

翻译：为满足不断增长的容量和计算需求，机器学习模型通过多种并行策略分布在多个节点上。因此，集合通信常常处于关键路径上，由于缺乏独立计算，通过重叠内核粒度的通信与计算来隐藏其延迟变得困难。本文提出利用GPU的大规模并行性和GPU发起的通信，将计算与依赖集合通信融合。我们开发了自包含的GPU内核，其中工作组（WGs）在完成计算后立即将其结果通信给远程GPU。同时，同一内核中的其他WG执行重叠计算，保持高ALU利用率。我们通过创建三个原型融合算子（嵌入层+All-to-All、GEMV+AllReduce和GEMM+All-to-All）来展示我们的方法，以解决DLRM、Transformer和MoE模型架构中普遍存在的通信开销。为了展示该方法可集成到机器学习框架中并在生产环境中广泛采用，我们将融合算子作为新的PyTorch算子暴露出来，并扩展了Triton框架以支持这些算子。评估表明，与当前基于集合库的方法相比，我们的方法能有效重叠通信与计算，从而降低两者的总执行时间。我们的扩展GEMV+AllReduce和GEMM+All-to-All实现分别降低执行时间达22%和20%，而融合嵌入层+All-to-All在节点内和节点间配置下分别降低执行时间20%和31%。大规模扩展模拟表明，我们的方法在128节点系统中将DLRM执行时间降低21%。