With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20% overall inference latency reduction with < 1% accuracy regression for LLaMA2-70B inference over 8 GPUs.
翻译:随着大语言模型规模的快速扩张,实现跨多个计算单元的高效分布式推理变得日益关键。然而,诸如张量并行等主流分布式推理技术所带来的通信开销,对实现可扩展性和低延迟构成了重大挑战。为此,我们提出一种新颖的优化技术——同步点丢弃,通过选择性丢弃注意力输出上的同步操作来降低张量并行中的通信开销。具体而言,我们首先提出一种允许通过SPD实现无通信执行的模块设计。其次,我们根据注意力模块对模型精度的敏感度差异,对其应用不同的SPD策略。所提出的方法能有效缓解通信瓶颈,同时最小化大语言模型推理过程中的精度损失,为多样化分布式环境提供可扩展解决方案:在8个GPU上运行LLaMA2-70B推理时,SPD实现了约20%的整体推理延迟降低,且精度损失小于1%。