In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now challenge this CPU dominance by reducing its involvement, granting GPUs more autonomy in communication tasks, and addressing mismatches in multi-GPU communication and computation. This paper provides a landscape of GPU-centric communication, focusing on vendor mechanisms and user-level library supports. It aims to clarify the complexities and diverse options in this field, define the terminology, and categorize existing approaches within and across nodes. The paper discusses vendor-provided mechanisms for communication and memory management in multi-GPU execution and reviews major communication libraries, their benefits, challenges, and performance insights. Then, it explores key research paradigms, future outlooks, and open research questions. By extensively describing GPU-centric communication techniques across the software and hardware stacks, we provide researchers, programmers, engineers, and library designers insights on how to exploit multi-GPU systems at their best.
翻译:近年来,GPU凭借其并行性和快速内存带宽,已成为高性能计算和机器学习应用的首选加速器。虽然GPU提升了计算能力,但GPU间通信可能造成可扩展性瓶颈,尤其是在每个节点和集群中GPU数量不断增长的情况下。传统上,CPU负责管理多GPU通信,但GPU中心化通信的进展正通过减少CPU参与度、赋予GPU更多通信任务自主权以及解决多GPU通信与计算之间的不匹配问题,挑战着CPU的主导地位。本文系统梳理了GPU中心化通信的现状,重点关注供应商机制和用户级库支持。旨在阐明该领域的复杂性和多样化选择,定义相关术语,并对节点内及跨节点的现有方法进行分类。本文讨论了多GPU执行中供应商提供的通信与内存管理机制,并综述了主流通信库及其优势、挑战和性能表现。随后,探讨了关键研究范式、未来展望以及开放的研究问题。通过深入描述贯穿软硬件栈的GPU中心化通信技术,我们为研究人员、程序员、工程师和库设计者提供了关于如何充分发挥多GPU系统潜力的见解。