Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation.
翻译:大型语言模型(LLM)推动了人工智能的前沿发展,但其参数量与计算量高达数千亿。为降低推理延迟,LLM通常通过多种模型并行策略部署在多个硬件加速器上。本文深入研究了其中一种策略——张量并行,并提出通过压缩加速器间通信来降低延迟。我们利用细粒度量化技术,将选定激活值的压缩比提升至3.5-4.5倍。所提方法在模型性能损失可忽略的前提下,使首词元生成时间(TTFT)最多缩短2倍。