The ever-increasing sizes of large language models necessitate distributed solutions for fast inference that exploit multi-dimensional parallelism, where computational loads are split across various accelerators such as GPU clusters. However, this approach often introduces significant communication overhead, especially on devices with limited bandwidth. In this paper, we introduce \emph{Flash Communication}, a novel low-bit compression technique designed to alleviate the tensor-parallelism communication bottleneck during inference. Our method substantially boosts intra-node communication speed by more than 3x and reduces the \emph{time-to-first-token} by 2x, with nearly no sacrifice in model accuracy. Extensive experiments on various up-to-date LLMs demonstrate the effectiveness of our approach.
翻译:大型语言模型规模的持续增长,需要利用多维并行性的分布式解决方案来实现快速推理,其中计算负载被分配到诸如GPU集群等多种加速器上。然而,这种方法通常会引入显著的通信开销,尤其是在带宽受限的设备上。在本文中,我们提出了 \emph{闪存通信},这是一种新颖的低比特压缩技术,旨在缓解推理过程中张量并行性的通信瓶颈。我们的方法将节点内通信速度提升了3倍以上,并将 \emph{首词生成时间} 减少了2倍,同时几乎不牺牲模型精度。在各种最新的大型语言模型上进行的广泛实验证明了我们方法的有效性。