Removing the CPU from the communication fast path is essential to efficient GPU-based ML and HPC application performance. However, existing GPU communication APIs either continue to rely on the CPU for communication or rely on APIs that place significant synchronization burdens on programmers. In this paper we describe the design, implementation, and evaluation of an MPI-based GPU communication API enabling easy-to-use, high-performance, CPU-free communication. This API builds on previously proposed MPI extensions and leverages HPE Slingshot 11 network card capabilities. We demonstrate the utility and performance of the API by showing how the API naturally enables CPU-free gather/scatter halo exchange communication primitives in the Cabana/Kokkos performance portability framework, and through a performance comparison with Cray MPICH on the Frontier and Tuolumne supercomputers. Results from this evaluation show up to a 50% reduction in medium message latency in simple GPU ping-pong exchanges and a 28% speedup improvement when strong scaling a halo-exchange benchmark to 8,192 GPUs of the Frontier supercomputer.
翻译:将CPU从通信快速路径中移除对于实现基于GPU的机器学习和高性能计算应用的高效性能至关重要。然而,现有的GPU通信API要么仍然依赖CPU进行通信,要么依赖于给程序员带来沉重同步负担的API。本文描述了一种基于MPI的GPU通信API的设计、实现与评估,该API能够实现易用、高性能且无需CPU参与的通信。此API建立在先前提出的MPI扩展之上,并利用了HPE Slingshot 11网络卡的功能。我们通过展示该API如何在Cabana/Kokkos性能可移植性框架中自然地实现无需CPU参与的聚集/分散(gather/scatter)晕区交换通信原语,以及通过在Frontier和Tuolumne超级计算机上与Cray MPICH进行性能比较,证明了该API的实用性和性能。评估结果显示,在简单的GPU乒乓交换测试中,中等大小消息的延迟降低了高达50%;在Frontier超级计算机的8,192个GPU上对晕区交换基准测试进行强扩展时,性能提升了28%。