Removing the CPU from the communication fast path is essential to efficient GPU-based ML and HPC application performance. However, existing GPU communication APIs either continue to rely on the CPU for communication or rely on APIs that place significant synchronization burdens on programmers. In this paper we describe the design, implementation, and evaluation of an MPI-based GPU communication API enabling easy-to-use, high-performance, CPU-free communication. This API builds on previously proposed MPI extensions and leverages HPE Slingshot 11 network card capabilities. We demonstrate the utility and performance of the API by showing how the API naturally enables CPU-free gather/scatter halo exchange communication primitives in the Cabana/Kokkos performance portability framework, and through a performance comparison with Cray MPICH on the Frontier and Tuolumne supercomputers. Results from this evaluation show up to a 50% reduction in medium message latency in simple GPU ping-pong exchanges and a 28% speedup improvement when strong scaling a halo-exchange benchmark to 8,192 GPUs of the Frontier supercomputer.
翻译:将CPU从通信快速路径中移除对于基于GPU的机器学习和高性能计算应用性能至关重要。然而,现有的GPU通信API要么仍然依赖CPU进行通信,要么依赖于给程序员带来显著同步负担的API。本文描述了一种基于MPI的GPU通信API的设计、实现与评估,该API能够实现易于使用、高性能且无需CPU参与的通信。此API建立在先前提出的MPI扩展之上,并利用了HPE Slingshot 11网络卡的功能。我们通过展示该API如何在Cabana/Kokkos性能可移植性框架中自然地实现无需CPU参与的聚集/分散光晕交换通信原语,以及通过在Frontier和Tuolumne超级计算机上与Cray MPICH进行性能对比,证明了该API的实用性和性能。评估结果显示,在简单的GPU乒乓交换测试中,中等大小消息的延迟降低了高达50%;在将光晕交换基准测试强扩展至Frontier超级计算机的8,192个GPU时,性能提升了28%。