This work studies post-training parameter quantization in large language models (LLMs). We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from incoherent weight and Hessian matrices, i.e., from the weights and the directions in which it is important to round them accurately being unaligned with the coordinate axes. QuIP consists of two steps: (1) an adaptive rounding procedure minimizing a quadratic proxy objective; (2) efficient pre- and post-processing that ensures weight and Hessian incoherence via multiplication by random orthogonal matrices. We complement QuIP with the first theoretical analysis for an LLM-scale quantization algorithm, and show that our theory also applies to an existing method, OPTQ. Empirically, we find that our incoherence preprocessing improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight. Our code can be found at https://github.com/jerry-chee/QuIP .
翻译:本文研究大型语言模型(LLM)中的训练后参数量化。我们提出基于非相干处理的量化方法(QuIP),该方法基于以下洞见:量化的效果受益于权重和Hessian矩阵的非相干性,即关键舍入方向的权重与坐标轴未对齐的特性。QuIP包含两个步骤:(1)最小化二次代理目标的自适应舍入过程;(2)通过随机正交矩阵乘法实现权重和Hessian非相干性的高效预处理与后处理。我们首次为基于LLM规模的量化算法提供理论分析,并证明该理论同样适用于现有方法OPTQ。实验表明,我们的非相干预处理可改进多种现有量化算法,并首次实现仅用每权重2比特即可产生有效结果的LLM量化方法。完整代码见https://github.com/jerry-chee/QuIP。