As large language models (LLMs) continue to scale, their workloads increasingly rely on distributed execution across multiple GPUs. However, the conventional bulk synchronous parallel~(BSP) model used in such settings introduces significant performance inefficiencies. To characterize these bottlenecks, we introduce the ''Three Taxes'' (Bulk Synchronous, Inter-Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework. We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution. By exploiting libraries like Iris for Triton, we gain access to in-kernel communication primitives that enable the design of novel fine-grained programming patterns, offering greater flexibility and performance than traditional BSP-based approaches. These patterns systematically eliminate the three taxes by creating direct, tile-level producer-consumer pipelines and replacing global barriers with fine-grained dataflow synchronization. Applying this methodology to critical kernels, from the foundational All-Gather + general matrix multiplication operation to the complex Flash Decode algorithm, we observe a 10-20% speedup in end-to-end latency over BSP-based approaches, establishing a more programmable and efficient paradigm for distributed LLM workloads.
翻译:随着大语言模型(LLMs)规模的持续扩大,其工作负载日益依赖于跨多个GPU的分布式执行。然而,此类场景中使用的传统批量同步并行(BSP)模型引入了显著的性能低效问题。为刻画这些瓶颈,我们提出“三重税”(批量同步税、内核间数据局部性税、内核启动开销税)作为分析框架。我们建议超越僵化的BSP模型,以解决分布式GPU执行中的关键低效问题。通过利用如Iris for Triton等库,我们获得了内核内通信原语,从而能够设计新颖的细粒度编程模式,这些模式比传统的基于BSP的方法提供了更高的灵活性和性能。这些模式通过创建直接的瓦片级生产者-消费者流水线,并以细粒度数据流同步替代全局屏障,系统地消除了三重税。将这一方法应用于从基础的All-Gather + 通用矩阵乘法运算到复杂的Flash Decode算法等关键内核,我们观察到端到端延迟相比基于BSP的方法提升了10-20%,为分布式LLM工作负载建立了一个更具可编程性和高效性的新范式。