Post-training quantization of Large Language Models (LLMs) is challenging. In this work, we introduce Low-rank Quantization Error Reduction (LQER), which combines quantization and low-rank approximation to recover the model capability. LQER leverages an activation-induced scale matrix to drive the singular value distribution of quantization error towards a desirable distribution, which enables nearly-lossless W4A8 quantization on various LLMs and downstream tasks without the need for knowledge distillation, grid search, or gradient-base iterative optimization. Unlike existing methods, the computation pattern of LQER eliminates the need for specialized Scatter and Gather processes to collect high-precision weights from irregular memory locations. Our W4A8 LLMs achieve near-lossless performance on six popular downstream tasks, while using 1.36$\times$ fewer hardware resources than the leading state-of-the-art method. We open-source our framework at https://github.com/ChengZhang-98/lqer
翻译:大语言模型(LLM)的训练后量化具有挑战性。本文提出低秩量化误差缩减方法(LQER),该方法结合量化与低秩近似以恢复模型能力。LQER利用激活诱导的缩放矩阵,将量化误差的奇异值分布导向理想分布,从而在无需知识蒸馏、网格搜索或基于梯度的迭代优化的情况下,在各种LLM及下游任务上实现近乎无损的W4A8量化。与现有方法不同,LQER的计算模式无需通过专门的Scatter和Gather过程从非常规内存位置收集高精度权重。我们的W4A8 LLM在六个主流下游任务上实现了接近无损的性能,同时所用硬件资源比领先的现有最优方法减少1.36倍。我们在https://github.com/ChengZhang-98/lqer开源了本框架。