In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now challenge this CPU dominance by reducing its involvement, granting GPUs more autonomy in communication tasks, and addressing mismatches in multi-GPU communication and computation. This paper provides a landscape of GPU-centric communication, focusing on vendor mechanisms and user-level library supports. It aims to clarify the complexities and diverse options in this field, define the terminology, and categorize existing approaches within and across nodes. The paper discusses vendor-provided mechanisms for communication and memory management in multi-GPU execution and reviews major communication libraries, their benefits, challenges, and performance insights. Then, it explores key research paradigms, future outlooks, and open research questions. By extensively describing GPU-centric communication techniques across the software and hardware stacks, we provide researchers, programmers, engineers, and library designers insights on how to exploit multi-GPU systems at their best.
翻译:近年来,凭借并行计算能力与高内存带宽优势,GPU已成为高性能计算与机器学习领域的首选加速器。虽然GPU能加速计算,但随着单节点和集群中GPU数量的增长,GPU间通信可能引发可扩展性瓶颈。传统上,CPU负责管理多GPU通信,但以GPU为中心的通信技术进步正通过减少CPU参与度、赋予GPU更多通信自主权、以及解决多GPU通信与计算不匹配问题,逐步挑战CPU的主导地位。本文全面梳理了以GPU为中心的通信技术,重点分析厂商机制与用户级库支持,旨在厘清该领域的复杂性与多样化方案,规范术语体系,并对节点内与节点间的现有方法进行分类。我们探讨了多GPU执行中厂商提供的通信与内存管理机制,评述了主流通信库的优势、挑战及性能表现。进而,本文深入剖析关键研究范式、未来展望及开放性问题。通过贯通软硬件堆栈的GPU中心通信技术系统性描述,为研究人员、程序员、工程师及库设计者提供如何充分发挥多GPU系统性能的深刻见解。