Large language models (LLMs) have revolutionized natural language processing tasks. However, their practical deployment is hindered by their immense memory and computation requirements. Although recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM, they hand-craft quantization parameters, which leads to low performance and fails to deal with extremely low-bit quantization. To tackle this issue, we introduce an Omnidirectionally calibrated Quantization (OmniQuant) technique for LLMs, which achieves good performance in diverse quantization settings while maintaining the computational efficiency of PTQ by efficiently optimizing various quantization parameters. OmniQuant comprises two innovative components including Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET). LWC modulates the extreme values of weights by optimizing the clipping threshold. Meanwhile, LET tackles activation outliers by shifting the challenge of quantization from activations to weights through a learnable equivalent transformation. Operating within a differentiable framework using block-wise error minimization, OmniQuant can optimize the quantization process efficiently for both weight-only and weight-activation quantization. For instance, the LLaMA-2 model family with the size of 7-70B can be processed with OmniQuant on a single A100-40G GPU within 1-16 hours using 128 samples. Extensive experiments validate OmniQuant's superior performance across diverse quantization configurations such as W4A4, W6A6, W4A16, W3A16, and W2A16. Additionally, OmniQuant demonstrates effectiveness in instruction-tuned models and delivers notable improvements in inference speed and memory reduction on real devices. Codes and models are available at \url{https://github.com/OpenGVLab/OmniQuant}.
翻译:大语言模型(LLMs)已彻底革新了自然语言处理任务。然而,它们庞大的内存和计算需求阻碍了实际部署。尽管近期提出的训练后量化(PTQ)方法在减少LLM内存占用和提升计算效率方面效果显著,但这些方法需手动设定量化参数,导致性能低下且无法应对极低比特的量化场景。为解决此问题,我们提出面向大语言模型的全方位校准量化技术(OmniQuant),该方法能在保持PTQ计算效率的同时,通过高效优化多种量化参数,在多样化量化设置中实现优异性能。OmniQuant包含两项创新组件:可学习权重裁剪(LWC)与可学习等效变换(LET)。LWC通过优化裁剪阈值调节权重的极值;LET则通过可学习的等效变换将量化挑战从激活值转移至权重,从而处理激活异常值。基于可微分框架与逐块误差最小化策略,OmniQuant能够高效优化仅权重量化与权重-激活联合量化过程。例如,参数量为7B-70B的LLaMA-2模型系列,可在单张A100-40G GPU上,使用128个样本,在1-16小时内通过OmniQuant完成处理。大量实验证明,OmniQuant在W4A4、W6A6、W4A16、W3A16及W2A16等多种量化配置中均展现卓越性能。此外,OmniQuant在指令微调模型中同样有效,并在实际设备上显著提升了推理速度与内存缩减效果。代码与模型已开源至 \url{https://github.com/OpenGVLab/OmniQuant}。