The rapid growth in the size of large language models has necessitated the partitioning of computational workloads across accelerators such as GPUs, TPUs, and NPUs. However, these parallelization strategies incur substantial data communication overhead significantly hindering computational efficiency. While communication-computation overlap presents a promising direction, existing data slicing based solutions suffer from tail latency. To overcome this limitation, this research introduces a novel communication-computation overlap technique to eliminate this tail latency in state of the art overlap methods for distributed LLM training. The aim of this technique is to effectively mitigate communication bottleneck of tensor parallelism and data parallelism for distributed training and inference. In particular, we propose a novel method termed CommFuse that replaces conventional collective operations of reduce-scatter and all-gather with decomposed peer-to-peer (P2P) communication and schedules partitioned computations to enable fine-grained overlap. Our method provides an exact algorithm for reducing communication overhead that eliminates tail latency. Moreover, it presents a versatile solution compatible with data-parallel training and various tensor-level parallelism strategies, including TPSP and UP. Experimental evaluations demonstrate that our technique consistently achieves lower latency, superior Model FLOPS Utilization (MFU), and high throughput.
翻译:随着大语言模型规模的快速增长,计算工作负载必须在GPU、TPU和NPU等加速器之间进行分区。然而,这些并行化策略会带来巨大的数据通信开销,严重阻碍计算效率。尽管通信-计算重叠是一个有前景的方向,但现有基于数据切分的方案存在尾部延迟问题。为克服这一局限,本研究提出了一种新颖的通信-计算重叠技术,用于消除当前最先进的分布式LLM训练重叠方法中的尾部延迟。该技术的目标是有效缓解张量并行与数据并行在分布式训练与推理中的通信瓶颈。具体而言,我们提出了名为CommFuse的新方法,该方法将传统的reduce-scatter和all-gather集合操作替换为分解的点对点(P2P)通信,并调度分区计算以实现细粒度重叠。我们的方法提供了一种精确的通信开销降低算法,能够消除尾部延迟。此外,它提出了一种通用解决方案,兼容数据并行训练及多种张量级并行策略(包括TPSP和UP)。实验评估表明,该技术始终能实现更低的延迟、更优的模型FLOPS利用率(MFU)以及高吞吐量。