梯级残差：一种面向并行加速大模型推理与通信重叠的架构 (Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping)

Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication. Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation. While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 30% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens.

翻译：大语言模型推理既占用大量内存又耗时，通常需要分布式算法以实现高效扩展。在多GPU训练和推理中，常采用多种模型并行策略将计算划分到多个设备上，以降低内存负载和计算时间。然而，使用模型并行需要在GPU之间进行信息通信，这已成为主要瓶颈，限制了通过增加设备数量所获得的性能增益。我们提出梯级残差，这是一种适用于所有基于残差的模型的简单架构修改，能够实现直接的重叠操作，有效隐藏通信延迟。我们的核心见解是，除了系统优化之外，还可以重新设计模型架构，以解耦通信与计算。虽然梯级残差可以在传统并行模式中实现通信-计算解耦，但本文重点关注张量并行，该模式尤其因其繁重的通信而成为瓶颈。对于一个拥有700亿参数的Transformer模型，在其所有层应用梯级残差，在8个设备上进行张量并行分片推理时，可实现端到端墙上时钟速度提升30%。我们将由此得到的Transformer模型称为梯级Transformer。我们从头训练了10亿和30亿参数的梯级Transformer，并观察到其性能与标准的密集Transformer基线相当。我们还证明，通过仅对30亿个词元进行重新训练，可以将Llama-3.1 80亿模型的部分结构转换为我们的梯级残差架构，且精度损失极小。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日