Tensor parallelism provides an effective way to increase server large language model (LLM) inference efficiency despite adding an additional communication cost. However, as server LLMs continue to scale in size, they will need to be distributed across more devices, magnifying the communication cost. One way to approach this problem is with quantization, but current methods for LLMs tend to avoid quantizing the features that tensor parallelism needs to communicate. Taking advantage of consistent outliers in communicated features, we introduce a quantization method that reduces communicated values on average from 16 bits to 4.2 bits while preserving nearly all of the original performance. For instance, our method maintains around 98.0% and 99.5% of Gemma 2 27B's and Llama 2 13B's original performance, respectively, averaged across all tasks we evaluated on.
翻译:张量并行为提高服务器端大语言模型推理效率提供了一种有效途径,尽管这会引入额外的通信开销。然而,随着服务器端大语言模型规模持续扩大,它们需要被分布到更多设备上,从而放大了通信成本。解决此问题的一种方法是量化,但当前针对大语言模型的方法往往避免对张量并行所需通信的特征进行量化。利用通信特征中持续存在的离群值,我们提出了一种量化方法,可将通信数值的平均比特数从16位降低至4.2位,同时几乎完全保留原始性能。例如,在我们评估的所有任务上,我们的方法平均分别保持了Gemma 2 27B和Llama 2 13B约98.0%和99.5%的原始性能。