Adaptive Rounding has emerged as an alternative to round-to-nearest (RTN) for post-training quantization by enabling cross-element error cancellation. Yet, dense and element-wise rounding matrices are prohibitively expensive for billion-parameter large language models (LLMs). We revisit adaptive rounding from an efficiency perspective and propose VQRound, a parameter-efficient optimization framework that reparameterizes the rounding matrix into a compact codebook. Unlike low-rank alternatives, VQRound minimizes the element-wise worst-case error under $L_\infty$ norm, which is critical for handling heavy-tailed weight distributions in LLMs. Beyond reparameterization, we identify rounding initialization as a decisive factor and develop a lightweight end-to-end finetuning pipeline that optimizes codebooks across all layers using only 128 samples. Extensive experiments on OPT, LLaMA, LLaMA2, and Qwen3 models demonstrate that VQRound achieves better convergence than traditional adaptive rounding at the same number of steps while using as little as 0.2% of the trainable parameters. Our results show that adaptive rounding can be made both scalable and fast-fitting. The code is available at https://github.com/zhoustan/VQRound.
翻译:自适应舍入作为一种替代最近邻舍入的后训练量化方法,通过实现跨元素误差抵消而受到关注。然而,对于拥有数十亿参数的大语言模型而言,稠密的逐元素舍入矩阵计算代价过高。本文从效率角度重新审视自适应舍入,提出VQRound——一种参数高效的优化框架,将舍入矩阵重参数化为紧凑的码本。与低秩替代方法不同,VQRound在$L_\infty$范数下最小化逐元素最坏情况误差,这对处理大语言模型中重尾权重分布至关重要。除重参数化外,我们发现舍入初始化是决定性因素,并开发了一个轻量级端到端微调流程,仅需128个样本即可优化所有层的码本。在OPT、LLaMA、LLaMA2和Qwen3模型上的大量实验表明,在相同训练步数下,VQRound比传统自适应舍入具有更好的收敛性,同时仅需0.2%的可训练参数。我们的结果表明自适应舍入方法既可扩展又能快速收敛。代码已开源:https://github.com/zhoustan/VQRound。