Communication has become a first-order bottleneck in large-cale GPU workloads, and existing distributed compilers address it mainly by overlapping whole compute and communication kernels at the stream level. This coarse granularity incurs extra kernel launches, forces device-wide synchronizations at kernel boundaries, and leaves substantial slack when the slowest tile or kernel stretches the communication tail. We present AutoOverlap, a compiler and runtime that enables automatic fine-grained overlap inside a single fused kernel. AutoOverlap introduces a communication chunk abstraction that decouples communication granularity from kernel structure and backend mechanisms, allowing chunk-level plans to be ported from existing distributed compilers, written directly by users, or instantiated from reusable templates. Given a local Triton kernel and a chunk schedule, AutoOverlap performs transformations to align computation with chunk availability. Implemented as a source-to-source compiler on Triton, AutoOverlap delivers an average end-to-end speedup of 1.3$\times$ and up to 4.7$\times$ on multi-GPU workloads.
翻译:通信已成为大规模GPU工作负载中的首要瓶颈,现有分布式编译器主要通过流级别上重叠整个计算与通信内核来解决此问题。这种粗粒度方法会引入额外的内核启动,强制在内核边界进行设备级同步,并在最慢的瓦片或内核拉长通信尾部时产生大量空闲时间。本文提出AutoOverlap,一种能够在单个融合内核内实现自动细粒度重叠的编译器与运行时系统。AutoOverlap引入通信分块抽象,将通信粒度与内核结构及后端机制解耦,使得分块级调度方案可从现有分布式编译器移植、由用户直接编写或通过可复用模板实例化。给定本地Triton内核与分块调度方案,AutoOverlap通过代码变换实现计算与分块可用性的对齐。作为基于Triton的源到源编译器实现,AutoOverlap在多GPU工作负载上实现了平均1.3$\times$、最高4.7$\times$的端到端加速比。