Large language models~(LLMs) have recently demonstrated promising performance in many tasks. However, the high storage and computational cost of LLMs has become a challenge for deploying LLMs. Weight quantization has been widely used for model compression, which can reduce both storage and computational cost. Most existing weight quantization methods for LLMs use a rank-one codebook for quantization, which results in substantial accuracy loss when the compression ratio is high. In this paper, we propose a novel weight quantization method, called low-rank codebook based quantization~(LCQ), for LLMs. LCQ adopts a low-rank codebook, the rank of which can be larger than one, for quantization. Experiments show that LCQ can achieve better accuracy than existing methods with a negligibly extra storage cost.
翻译:大语言模型(LLMs)近期在众多任务中展现出优异的性能。然而,LLMs高昂的存储与计算成本已成为其实际部署的挑战。权重量化作为一种广泛应用的模型压缩技术,能够同时降低存储与计算开销。现有的大语言模型权重量化方法大多采用秩为一的码本进行量化,导致在高压缩比下精度损失显著。本文提出一种新颖的权重量化方法——基于低秩码本的量化(LCQ),该方法采用秩可大于一的低秩码本进行量化。实验表明,LCQ能够在引入可忽略的额外存储开销的前提下,取得优于现有方法的精度表现。