Large-scale language models (LLMs) have demonstrated impressive performance, but their deployment presents challenges due to their significant memory usage. This issue can be alleviated through quantization. In this paper, we identify that the challenge in quantizing activations in LLMs arises from varying ranges across channels, rather than solely the presence of outliers. To address this challenge, we introduce a quantization method called RPTQ, which utilizes a reorder-based approach. By rearranging the channels and quantizing them in clusters, RPTQ effectively mitigates the impact of range differences between channels. To minimize the overhead of the reorder operation, we fuse it into the layer norm operation and weights in linear layers. In our experiments, RPTQ achieved a significant breakthrough by utilizing 3-bit activation in LLMs for the first time, resulting in a substantial reduction in memory usage. For instance, quantizing OPT-175b can lead to a memory consumption reduction of up to 80%.
翻译:大规模语言模型(LLMs)展现了卓越的性能,但其部署因显著的内存占用而面临挑战。量化可缓解该问题。本文指出,LLMs中激活值量化的难点源于通道间数值范围的差异,而非仅由离群值所致。为解决此问题,我们提出一种名为RPTQ的量化方法,采用基于重排的策略。通过重新排列通道并对其进行分组量化,RPTQ有效减轻了通道间范围差异的影响。为最小化重排操作的额外开销,我们将其融合至层归一化操作和线性层权重中。实验表明,RPTQ首次在LLMs中成功应用3比特激活量化,使内存占用大幅降低——例如,量化OPT-175b模型可实现高达80%的内存消耗缩减。