Large-scale language models (LLMs) have demonstrated outstanding performance on various tasks, but their deployment poses challenges due to their enormous model size. In this paper, we identify that the main challenge in quantizing LLMs stems from the different activation ranges between the channels, rather than just the issue of outliers.We propose a novel reorder-based quantization approach, RPTQ, that addresses the issue of quantizing the activations of LLMs. RPTQ rearranges the channels in the activations and then quantizing them in clusters, thereby reducing the impact of range difference of channels. In addition, we reduce the storage and computation overhead by avoiding explicit reordering. By implementing this approach, we achieved a significant breakthrough by pushing LLM models to 3 bit activation for the first time.
翻译:大规模语言模型(LLMs)在各种任务中展现出卓越性能,但其巨大的模型规模给部署带来了挑战。本文指出,LLMs量化面临的主要困难源于不同通道间的激活值范围差异,而不仅仅是异常值问题。我们提出一种新颖的基于重排的量化方法RPTQ,通过重新排列激活值中的通道顺序并进行聚类量化,有效降低了通道间范围差异的影响。此外,我们通过避免显式重排操作,减少了存储与计算开销。采用该方法,我们首次成功将LLM模型的激活值量化至3比特,取得了重大突破。