TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

from arxiv, 14 pages, 16 figures. For source code, see https://github.com/microsoft/tokenweave. In version 2, Figure 6 shows All-Reduce bandwidth instead of Reduce-Scatter. The Multimem Reduce-Scatter bandwidth formula differs slightly from the ring-based version. Fixed x-ticks in Figure 7

Distributed inference of large language models (LLMs) can introduce overheads of up to 20% even over GPUs connected via high-speed interconnects such as NVLink. Multiple techniques have been proposed to mitigate these overheads by decomposing computations into finer-grained tasks and overlapping communication with sub-tasks as they complete. However, fine-grained decomposition of a large computation into many smaller computations on GPUs results in overheads. Furthermore, the communication itself uses many streaming multiprocessors (SMs), adding to the overhead. We present TokenWeave to address these challenges. TokenWeave proposes a Token-Splitting technique that divides the tokens in the inference batch into two approximately equal subsets in a wave-aware manner. The communication of one subset is then overlapped with the computation of the other. In addition, TokenWeave optimizes the order of the layer normalization computation with respect to communication operations and implements a novel fused AllReduce--RMSNorm kernel that carefully leverages Multimem instruction support available on Hopper and Blackwell NVIDIA GPUs. These optimizations allow TokenWeave to perform communication and RMSNorm using only 2-8 SMs. Moreover, our kernel enables the memory-bound RMSNorm to be overlapped with the other batch's computation, providing additional gains. Our evaluations demonstrate up to 1.29x speedup in latency and 1.26x higher throughput across multiple models and workloads. In several settings, TokenWeave results in better performance compared to an equivalent model with all communication removed.

翻译：大语言模型（LLMs）的分布式推理即使在通过高速互连（如NVLink）连接的GPU上也可能引入高达20%的开销。已有多种技术通过将计算分解为更细粒度的任务，并在子任务完成时与通信重叠来缓解这些开销。然而，在GPU上将大型计算细粒度分解为许多较小计算会产生额外开销。此外，通信本身会占用大量流式多处理器（SMs），进一步增加开销。本文提出TokenWeave以应对这些挑战。TokenWeave提出一种令牌分割技术，以波感知方式将推理批次中的令牌划分为两个近似相等的子集，使一个子集的通信与另一个子集的计算重叠。此外，TokenWeave优化了层归一化计算相对于通信操作的执行顺序，并实现了一种新颖的融合AllReduce-RMSNorm内核，该内核巧妙利用了Hopper和Blackwell架构NVIDIA GPU上的多内存指令支持。这些优化使TokenWeave仅需2-8个SMs即可完成通信和RMSNorm操作。同时，我们的内核使内存受限的RMSNorm能够与另一批次的计算重叠，带来额外增益。评估结果表明，在多种模型和工作负载下，TokenWeave实现了最高1.29倍的延迟加速和1.26倍的吞吐量提升。在若干场景中，TokenWeave甚至优于移除了所有通信的等效模型。