Large Language Models (LLMs) have demonstrated remarkable capabilities but typically require extensive computational resources and memory for inference. Post-training quantization (PTQ) can effectively reduce these demands by storing weights in lower bit-width formats. However, standard uniform quantization often leads to notable performance degradation, particularly in low-bit scenarios. In this work, we introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook, defined by a learnable generation matrix. To address the non-differentiability of the quantization process, we adopt Babai rounding to approximate nearest-lattice-point search during training, which enables stable optimization of the generation matrices. Once trained, decoding reduces to a simple matrix-vector multiplication, yielding an efficient and practical quantization pipeline. Experiments on multiple benchmarks show that our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines, highlighting its effectiveness in deploying large models under stringent resource constraints. Our source code is available on GitHub repository: https://github.com/xzhang9308/GLVQ.
翻译:大语言模型(LLMs)已展现出卓越的能力,但其推理过程通常需要大量的计算资源和内存。训练后量化(PTQ)通过将权重以更低比特宽度的格式存储,能有效降低这些需求。然而,标准的均匀量化通常会导致显著的性能下降,尤其是在低比特场景下。在本工作中,我们提出了一种分组格点矢量量化(GLVQ)框架,该框架为每组权重分配一个定制的格点码本,该码本由一个可学习的生成矩阵定义。为了解决量化过程的不可微问题,我们在训练中采用Babai舍入来近似最近格点搜索,从而实现对生成矩阵的稳定优化。一旦训练完成,解码过程简化为简单的矩阵-向量乘法,形成了一个高效且实用的量化流程。在多个基准测试上的实验表明,与现有的训练后量化基线方法相比,我们的方法在模型大小与精度之间取得了更好的权衡,突显了其在严格资源约束下部署大型模型的有效性。我们的源代码可在GitHub仓库获取:https://github.com/xzhang9308/GLVQ。