With the rising popularity of Large Language Models (LLMs), there has been an increasing interest in compression techniques that enable their efficient deployment. This study focuses on the Post-Training Quantization (PTQ) of LLMs. Drawing from recent advances, our work introduces QuantEase, a layer-wise quantization framework where individual layers undergo separate quantization. The problem is framed as a discrete-structured non-convex optimization, prompting the development of algorithms rooted in Coordinate Descent (CD) techniques. These CD-based methods provide high-quality solutions to the complex non-convex layer-wise quantization problems. Notably, our CD-based approach features straightforward updates, relying solely on matrix and vector operations, circumventing the need for matrix inversion or decomposition. We also explore an outlier-aware variant of our approach, allowing for retaining significant weights (outliers) with complete precision. Our proposal attains state-of-the-art performance in terms of perplexity and zero-shot accuracy in empirical evaluations across various LLMs and datasets, with relative improvements up to 15% over methods such as GPTQ. Particularly noteworthy is our outlier-aware algorithm's capability to achieve near or sub-3-bit quantization of LLMs with an acceptable drop in accuracy, obviating the need for non-uniform quantization or grouping techniques, improving upon methods such as SpQR by up to two times in terms of perplexity.
翻译:随着大型语言模型(LLMs)的日益普及,对其高效部署的压缩技术引起了广泛关注。本研究聚焦于LLMs的训练后量化(PTQ)。借鉴最新进展,我们提出了QuantEase——一种逐层量化框架,其中各层独立进行量化。该问题被建模为离散结构的非凸优化,从而促成了基于坐标下降(CD)技术的算法开发。基于CD的方法为复杂的非凸逐层量化问题提供了高质量解决方案。值得注意的是,我们的CD方法具有简洁的更新步骤,仅依赖矩阵和向量运算,无需矩阵求逆或分解。我们还探索了一种考虑异常值的变体算法,允许以完整精度保留重要权重(异常值)。在多种LLMs和数据集上的实验评估中,我们的方案在困惑度和零样本准确率方面达到了最先进水平,相比GPTQ等方法实现了最高15%的相对改进。尤其值得关注的是,我们的异常值感知算法能够在可接受的精度损失下实现LLMs接近或低于3比特的量化,且无需采用非均匀量化或分组技术,在困惑度指标上较SpQR等方法提升高达两倍。