SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization

Post-training quantization (PTQ) has emerged as a prevailing technique for deploying large language models (LLMs) efficiently in terms of both memory and computation, across edge devices and server platforms. Existing PTQ methods primarily aim to reduce precision in weights and activations by mitigating quantization errors caused by channel-wise outlier activations (e.g., pre-quantization scaling, online transformations, or low-rank error reconstruction). Among these approaches, error reconstruction with low-rank adaptation (LoRA) has proven particularly effective, as it introduces a lightweight auxiliary computation path without requiring heavy optimization or additional online layers. However, prior studies reveal severe accuracy degradation under W4A4 settings, and conventional low-rank adaptations rely on two sequential factors, necessitating intermediate quantization during inference and thereby limiting low-precision efficiency. In this work, we propose SERQ, a saliency-aware error reconstruction method for low-bit LLM inference that employs a single low-rank compensation matrix. SERQ preserves efficient 4-bit matrix multiplication in linear layers by jointly mitigating quantization errors arising from both activation and weight saliency through three stages: (1) static activation flattening, (2) saliency-aware error reconstruction, and (3) offline weight permutation. The method incurs additional computation only for low-rank error reconstruction via a single decomposition, while all other operations are performed offline, thereby keeping latency overhead minimal. Empirically, SERQ outperforms prior error reconstruction methods under both W4A8 and W4A4 settings, and achieves higher accuracy than state-of-the-art rotation-based W4A4 approaches, while substantially reducing calibration complexity.

翻译：训练后量化已成为在边缘设备和服务器平台上高效部署大型语言模型的主流技术，其在内存和计算方面均表现出优越性。现有PTQ方法主要通过缓解通道级离群激活值引起的量化误差来实现权重和激活值的低位宽化，典型策略包括预量化缩放、在线变换或低秩误差重构。在这些方法中，基于低秩自适应（LoRA）的误差重构被证明尤为有效，因其通过轻量级辅助计算路径实现优化，无需复杂的优化过程或额外的在线网络层。然而，现有研究表明该方法在W4A4配置下存在严重的精度损失问题，且传统低秩自适应依赖两个串行因子，需要在推理过程中进行中间量化，从而限制了低位宽推理的效率。本文提出SERQ——一种面向低位宽LLM推理的显著性感知误差重构方法，该方法采用单一低秩补偿矩阵。SERQ通过三阶段联合缓解激活值与权重显著性引起的量化误差，从而在保持线性层高效4比特矩阵乘法的同时实现精度提升：（1）静态激活值平坦化，（2）显著性感知误差重构，（3）离线权重置换。该方法仅需通过单次分解进行低秩误差重构的额外计算，其余操作均在离线阶段完成，使得推理延迟开销最小化。实验表明，SERQ在W4A8和W4A4配置下均优于现有误差重构方法，在显著降低校准复杂度的同时，其精度表现超越了当前最先进的基于旋转操作的W4A4量化方案。