Large language models (LLMs) with hundreds of billions of parameters show impressive results across various language tasks using simple prompt tuning and few-shot examples, without the need for task-specific fine-tuning. However, their enormous size requires multiple server-grade GPUs even for inference, creating a significant cost barrier. To address this limitation, we introduce a novel post-training quantization method for weights with minimal quality degradation. While activation outliers are known to be problematic in activation quantization, our theoretical analysis suggests that we can identify factors contributing to weight quantization errors by considering activation outliers. We propose an innovative PTQ scheme called outlier-aware weight quantization (OWQ), which identifies vulnerable weights and allocates high-precision to them. Our extensive experiments demonstrate that the 3.01-bit models produced by OWQ exhibit comparable quality to the 4-bit models generated by OPTQ.
翻译:拥有数千亿参数的大型语言模型(LLMs)通过简单的提示调优和少量样本示例,无需特定任务微调即可在各种语言任务中展现出令人印象深刻的结果。然而,其庞大的规模即使在推理阶段也需要多台服务器级GPU,造成了显著的成本障碍。为应对这一局限,我们提出了一种新颖的训练后权重量化方法,其质量退化极小。尽管激活值异常在激活量化中已知存在问题,但我们的理论分析表明,通过考虑激活值异常,可以识别导致权重量化误差的因素。我们提出了一种创新的PTQ方案,称为异常感知权重量化(OWQ),该方案能够识别脆弱权重并为其分配高精度。大量实验证明,OWQ生成的3.01位模型与OPTQ生成的4位模型质量相当。