Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation to meet a certain latency requirement. However, this kind of parallelism introduces additional communication that might contribute a significant portion of overall runtime. Thus limits scalability of this technique within a group of devices with high speed interconnects, such as GPUs with NVLinks in a node. This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux over-decomposes communication and computation operations into much finer-grained operations and further fuses them into a larger kernel to effectively hide communication without compromising kernel efficiency. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPUs with various GPU generations and interconnects, and up to 1.66x and 1.30x speedups for prefill and decoding inference over vLLM on a cluster with 8 GPUs with various GPU generations and interconnects.
翻译:大型深度学习模型已展现出解决跨众多应用领域任务的强大能力。这些大型模型通常需要分布式训练与推理。张量并行是一种常见技术,它将单个操作或层的计算划分到多个设备上,以克服单个处理器的内存容量限制,和/或加速计算以满足特定延迟要求。然而,这种并行性引入了额外的通信开销,这可能占据整体运行时间的相当大部分,从而限制了该技术在具有高速互连(例如节点内配备NVLink的GPU)的设备组内的可扩展性。本文提出了一种新颖的方法FLUX,用于在GPU上显著隐藏依赖计算带来的通信延迟。FLUX将通信和计算操作过度分解为更细粒度的操作,并进一步将它们融合到一个更大的内核中,从而在不影响内核效率的前提下有效隐藏通信。在给定融合内核的情况下,FLUX理论上可重叠高达96%的通信。总体而言,在配备不同代际GPU和互连的128个GPU集群上,与Megatron-LM相比,FLUX在训练中可实现高达1.24倍的加速;在配备不同代际GPU和互连的8个GPU集群上,与vLLM相比,在预填充和解码推理中分别可实现高达1.66倍和1.30倍的加速。