Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication. Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation. While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 30% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens.
翻译:大语言模型推理既占用大量内存又耗时,通常需要分布式算法以实现高效扩展。在多GPU训练和推理中,常采用多种模型并行策略将计算划分到多个设备上,以降低内存负载和计算时间。然而,使用模型并行需要在GPU之间进行信息通信,这已成为主要瓶颈,限制了通过增加设备数量所获得的性能增益。我们提出梯级残差,这是一种适用于所有基于残差的模型的简单架构修改,能够实现直接的重叠操作,有效隐藏通信延迟。我们的核心见解是,除了系统优化之外,还可以重新设计模型架构,以解耦通信与计算。虽然梯级残差可以在传统并行模式中实现通信-计算解耦,但本文重点关注张量并行,该模式尤其因其繁重的通信而成为瓶颈。对于一个拥有700亿参数的Transformer模型,在其所有层应用梯级残差,在8个设备上进行张量并行分片推理时,可实现端到端墙上时钟速度提升30%。我们将由此得到的Transformer模型称为梯级Transformer。我们从头训练了10亿和30亿参数的梯级Transformer,并观察到其性能与标准的密集Transformer基线相当。我们还证明,通过仅对30亿个词元进行重新训练,可以将Llama-3.1 80亿模型的部分结构转换为我们的梯级残差架构,且精度损失极小。