Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation.
翻译:大语言模型(LLMs)推动了人工智能的前沿发展,但其参数量与计算量高达数千亿。为降低推理延迟,大语言模型通常通过多种模型并行策略部署在多个硬件加速器上。本文深入探讨了其中一种策略——张量并行,并提出通过压缩加速器间通信来降低延迟。我们利用细粒度量化技术,将选定激活值的通信量压缩至原大小的3.5至4.5分之一。所提方法能够将首词生成时间(TTFT)降低最高达2倍,同时模型性能损失可忽略不计。