We consider the problem of model compression for Large Language Models (LLMs) at post-training time, where the task is to compress a well-trained model using only a small set of calibration input data. In this work, we introduce a new low-rank approach to correct for quantization errors of \emph{activations} in LLMs: we propose to add low-rank weight matrices in full precision that act on the \emph{unquantized} activations. We then solve a joint optimization problem over the quantized representation of the weights and additional low-rank weight matrices to quantize both weights and activations. We focus on the case of 4-bit weight-and-activation quantization (W4A4). Using ranks equivalent to 10\% of the original weight matrix size, our approach reduces the accuracy gap with the original model by more than 50\%. Using ranks equivalent to 30\% of the original weight matrix, the accuracy gap is closed completely. We demonstrate our results on four recent LLMs, namely Llama-2, Llama-3, Phi-3 and Mixtral models.
翻译:本文研究大型语言模型(LLMs)在训练后阶段的模型压缩问题,其任务在于仅使用少量校准输入数据对训练完备的模型进行压缩。本工作提出一种新颖的低秩方法来修正LLM中**激活值**的量化误差:我们建议添加全精度的低秩权重矩阵,这些矩阵作用于**未量化**的激活值。随后,我们通过联合优化权重量化表示与附加低秩权重矩阵,实现对权重和激活值的共同量化。我们重点关注4位权重-激活值量化(W4A4)场景。当使用相当于原始权重矩阵规模10%的秩时,我们的方法可将量化模型与原始模型的精度差距缩小50%以上;当秩达到原始权重矩阵规模的30%时,精度差距可完全消除。我们在四种近期发布的LLM(Llama-2、Llama-3、Phi-3和Mixtral模型)上验证了所提方法的有效性。