NeUQI: Near-Optimal Uniform Quantization Parameter Initialization for Low-Bit LLMs

Large language models (LLMs) achieve impressive performance across domains but face significant challenges when deployed on consumer-grade GPUs or personal devices such as laptops, due to high memory consumption and inference costs. Post-training quantization (PTQ) of LLMs offers a promising solution that reduces their memory footprint and decoding latency. In practice, PTQ with uniform quantization representation is favored due to its efficiency and ease of deployment, as uniform quantization is widely supported by mainstream hardware and software libraries. Recent studies on low-bit uniform quantization have led to noticeable improvements in post-quantization model performance; however, they mainly focus on quantization methodologies, while the initialization of quantization parameters remains underexplored and still relies on the conventional Min-Max formula. In this work, we identify the limitations of the Min-Max formula, move beyond its constraints, and propose NeUQI, a method that efficiently determines near-optimal initialization for uniform quantization. Our NeUQI simplifies the joint optimization of the scale and zero-point by deriving the zero-point for a given scale, thereby reducing the problem to a scale-only optimization. Benefiting from the improved quantization parameters, our NeUQI consistently outperforms existing methods in the experiments with the LLaMA and Qwen families on various settings and tasks. Furthermore, when combined with a lightweight distillation strategy, NeUQI even achieves superior performance to PV-tuning, a considerably more resource-intensive method.

翻译：大语言模型（LLM）在各领域展现出卓越性能，但由于高内存消耗和推理成本，在消费级GPU或个人设备（如笔记本电脑）上部署时面临重大挑战。LLM的后训练量化（PTQ）提供了一种有前景的解决方案，可显著降低其内存占用和解码延迟。实践中，采用均匀量化表示的PTQ因其高效性和易部署性而备受青睐，因为均匀量化已获得主流硬件和软件库的广泛支持。近期关于低比特均匀量化的研究在提升量化后模型性能方面取得了显著进展；然而，这些研究主要聚焦于量化方法本身，而量化参数的初始化问题仍未得到充分探索，且仍依赖于传统的Min-Max公式。本文中，我们揭示了Min-Max公式的局限性，突破其约束，提出了NeUQI——一种能够高效确定均匀量化近最优初始化参数的方法。我们的NeUQI通过推导给定缩放因子下的零点值，将缩放因子与零点的联合优化问题简化为仅对缩放因子的优化，从而降低了问题复杂度。得益于改进的量化参数，NeUQI在LLaMA和Qwen系列模型上的多场景任务实验中均持续优于现有方法。此外，当结合轻量级蒸馏策略时，NeUQI甚至能够超越资源消耗大得多的PV-tuning方法，实现更优的性能。