The ever-increasing sizes of large language models necessitate distributed solutions for fast inference that exploit multi-dimensional parallelism, where computational loads are split across various accelerators such as GPU clusters. However, this approach often introduces significant communication overhead, especially on devices with limited bandwidth. In this paper, we introduce Flash Communication, a novel low-bit compression technique designed to alleviate the tensor-parallelism communication bottleneck during inference. Our method substantially boosts intra-node communication speed by more than 3x and reduces the time-to-first-token by 2x, with nearly no sacrifice in model accuracy. Extensive experiments on various up-to-date LLMs demonstrate the effectiveness of our approach.
翻译:随着大语言模型规模的持续增长,需要采用分布式解决方案以实现快速推理,这些方案利用多维并行性将计算负载分配到多个加速器(如GPU集群)上。然而,这种方法通常会引入显著的通信开销,在带宽受限的设备上尤为突出。本文提出闪存通信,一种新颖的低比特压缩技术,旨在缓解推理过程中张量并行化带来的通信瓶颈。我们的方法将节点内通信速度提升了3倍以上,并将首词生成时间缩短了2倍,同时几乎不牺牲模型精度。在各种最新大语言模型上进行的大量实验验证了我们方法的有效性。