Distributed GPU applications increasingly rely on kernel-level, cross-node coordination to reduce launch overheads and improve compute-communication overlap, but such support is lacking. On OFI-based interconnects such as HPE Slingshot, which powers six of the top ten systems in the November 2025 Top500, including the top three, GPU kernels cannot autonomously drive distributed coordination: existing runtimes rely on host-driven progress and lack a bounded mechanism for recycling pre-staged NIC work across repeated GPU-triggered operations. On InfiniBand, GPU-initiated communication is possible, but current implementations incur unnecessary synchronization and locking overheads. This paper presents GICC, a framework that enables GPU kernels to directly trigger NIC-level operations without host involvement on the fast path. In stencils, GPU threads initiate halo exchanges as soon as boundary regions are computed, enabling fine-grained overlap between interior computation and boundary transfer. GICC decouples coordination semantics from data movement and introduces asynchronous resource reclamation: the NIC signals completion to both GPU and host memory, letting a lightweight host thread recycle NIC resources concurrently with GPU execution without injecting latency into the coordination path. This sustains GPU-driven coordination under finite NIC state, absent from existing OFI-based runtimes. We implement GICC on NVIDIA and AMD GPUs over InfiniBand and Slingshot. On Slingshot, GICC reduces per-coordination latency by up to 229x and improves weak scaling efficiency by up to 25%. On InfiniBand, it achieves up to 1.95x lower put latency than NVSHMEM by eliminating unnecessary locking and synchronization. On an industrial stencil proxy on 64 AMD MI250X GCDs, GPU-aware MPI incurs over 52% higher communication time than GICC, which achieves 42% parallel efficiency versus MPI's 35.4%.
翻译:分布式GPU应用日益依赖内核级别的跨节点协调,以减少启动开销并提升计算与通信的重叠程度,然而当前尚缺乏此类支持。在基于OFI的互连架构(如HPE Slingshot)上——该架构支撑了2025年11月Top500榜单前十系统中的六个系统(包括前三名)——GPU内核无法自主驱动分布式协调:现有运行时依赖主机驱动的进度推进,且缺乏在有界机制下回收预置NIC工作以应对重复GPU触发操作的能力。在InfiniBand上,GPU发起的通信虽可实现,但现有实现引入了不必要的同步与锁开销。本文提出GICC框架,使GPU内核能够在快速路径上直接触发NIC级操作,无需主机参与。在模板计算中,GPU线程一旦完成边界区域计算即立即发起边界交换,从而实现内部计算与边界传输的细粒度重叠。GICC将协调语义与数据移动解耦,并引入异步资源回收机制:NIC同时向GPU和主机内存发信号通知完成,使轻量级主机线程可在GPU并发执行时回收NIC资源,且不向协调路径注入延迟。这使得在有限NIC状态下可持续支持GPU驱动的协调——这是现有基于OFI的运行时缺失的能力。我们基于InfiniBand和Slingshot在NVIDIA和AMD GPU上实现了GICC。在Slingshot上,GICC将单次协调延迟降低最高达229倍,弱扩展效率提升最高达25%。在InfiniBand上,与NVSHMEM相比,GICC消除了不必要的锁与同步,put延迟降低最高达1.95倍。在基于64个AMD MI250X GCD的工业模板代理程序上,GPU感知MPI的通信时间比GICC高出52%以上,GICC实现42%的并行效率,而MPI为35.4%。