The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

from arxiv, Published as a conference paper at the Fourteenth International Conference on Learning Representations (ICLR 2026): https://openreview.net/forum?id=NFB4QGGS65

Quantizing the weights of large language models (LLMs) from 16-bit to lower bitwidth is the de facto approach to deploy massive transformers onto more affordable accelerators. While GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale, its inner workings are described as a sequence of algebraic updates that obscure geometric meaning or worst-case guarantees. In this work, we show that, when executed back-to-front (from the last to first dimension) for a linear layer, GPTQ is mathematically identical to Babai's nearest plane algorithm for the classical closest vector problem (CVP) on a lattice defined by the Hessian matrix of the layer's inputs. This equivalence is based on a sophisticated mathematical argument, and has two analytical consequences: first, the GPTQ error propagation step gains an intuitive geometric interpretation; second, GPTQ inherits the error upper bound of Babai's algorithm under the assumption that no weights are clipped. Leveraging this bound, we design post-training quantization methods that avoid clipping, and outperform the original GPTQ. In addition, we provide efficient GPU inference kernels for the resulting representation. Taken together, these results place GPTQ on a firm theoretical footing and open the door to importing decades of progress in lattice algorithms towards the design of future quantization algorithms for billion-parameter models. Source code is available at https://github.com/IST-DASLab/GPTQ-Babai.

翻译：将大语言模型的权重从16比特量化至更低比特宽度，是在更经济的加速器上部署大规模Transformer的事实标准方法。尽管GPTQ已成为大语言模型规模下一次性训练后量化的主流方法之一，但其内部机制被描述为一系列代数更新过程，掩盖了几何意义或最差情况保证。在本工作中，我们证明：当对线性层从后向前（即从最后一个维度到第一个维度）执行时，GPTQ在数学上与Babai最近平面算法完全等价，该算法用于解决由该层输入的海森矩阵所定义格上的经典最短向量问题。这一等价性基于严格的数学论证，并推导出两项分析结论：首先，GPTQ的误差传播步骤获得了直观的几何解释；其次，在假设无权重被截断的条件下，GPTQ继承了Babai算法的误差上界。利用这一上界，我们设计了避免截断的训练后量化方法，其性能超越原始GPTQ。此外，我们为所得表示提供了高效的GPU推理内核。综合这些结果，本研究将GPTQ置于坚实的理论基础上，并开辟了将格算法领域数十年的进展引入十亿参数模型未来量化算法设计的道路。源代码见https://github.com/IST-DASLab/GPTQ-Babai。