With the rising popularity of Large Language Models (LLMs), there has been an increasing interest in compression techniques that enable their efficient deployment. This study focuses on the Post-Training Quantization (PTQ) of LLMs. Drawing from recent advances, our work introduces QuantEase, a layer-wise quantization framework where individual layers undergo separate quantization. The problem is framed as a discrete-structured non-convex optimization, prompting the development of algorithms rooted in Coordinate Descent (CD) techniques. These CD-based methods provide high-quality solutions to the complex non-convex layer-wise quantization problems. Notably, our CD-based approach features straightforward updates, relying solely on matrix and vector operations, circumventing the need for matrix inversion or decomposition. We also explore an outlier-aware variant of our approach, allowing for retaining significant weights (outliers) with complete precision. Our proposal attains state-of-the-art performance in terms of perplexity and zero-shot accuracy in empirical evaluations across various LLMs and datasets, with relative improvements up to 15% over methods such as GPTQ. Leveraging careful linear algebra optimizations, QuantEase can quantize models like Falcon-180B on a single NVIDIA A100 GPU in $\sim$3 hours. Particularly noteworthy is our outlier-aware algorithm's capability to achieve near or sub-3-bit quantization of LLMs with an acceptable drop in accuracy, obviating the need for non-uniform quantization or grouping techniques, improving upon methods such as SpQR by up to two times in terms of perplexity.
翻译:随着大型语言模型(LLM)的日益普及,对其高效部署的压缩技术引起了广泛关注。本研究聚焦于LLM的训练后量化(PTQ)。受近期进展启发,我们提出QuantEase——一种逐层量化框架,其中各层分别独立进行量化。该问题被建模为离散结构的非凸优化,进而催生了基于坐标下降(CD)技术的算法开发。这些基于坐标下降的方法能够为复杂的非凸逐层量化问题提供高质量解。值得注意的是,基于坐标下降的方法仅需矩阵与向量运算即可实现简洁更新,无需矩阵求逆或分解。我们还探索了该方法的离群值感知变体,允许以完全精度保留重要权重(离群值)。在多种LLM与数据集上的实验评估中,本方案在困惑度与零样本准确率方面均达到当前最优性能,相较于GPTQ等方法实现了高达15%的相对提升。借助精细的线性代数优化,QuantEase可在单个NVIDIA A100 GPU上约3小时内完成Falcon-180B等模型的量化。尤为值得注意的是,我们的离群值感知算法能够在可接受的精度损失下实现LLM的近3比特或低于3比特量化,无需采用非均匀量化或分组技术,且在困惑度指标上相较于SpQR等方法提升达两倍。