Large language models (LLMs) deliver robust performance across diverse applications, yet their deployment often faces challenges due to the memory and latency costs of storing and accessing billions of parameters. Post-training quantization (PTQ) enables efficient inference by mapping pretrained weights to low-bit formats without retraining, but its effectiveness depends critically on both the quantization objective and the rounding procedure used to obtain low-bit weight representations. In this work, we show that interpolating between symmetric and asymmetric calibration acts as a form of regularization that preserves the standard quadratic structure used in PTQ while providing robustness to activation mismatch. Building on this perspective, we derive a simple successive rounding procedure that naturally incorporates asymmetric calibration, as well as a bounded-search extension that allows for an explicit trade-off between quantization quality and the compute cost. Experiments across multiple LLM families, quantization bit-widths, and benchmarks demonstrate that the proposed bounded search based on a regularized asymmetric calibration objective consistently improves perplexity and accuracy over PTQ baselines, while incurring only modest and controllable additional computational cost.
翻译:大型语言模型(LLM)在各种应用中展现出强大的性能,但其部署常因存储和访问数十亿参数所需的内存与延迟成本而面临挑战。后训练量化(PTQ)通过将预训练权重映射到低比特格式而无需重新训练,从而实现高效推理,但其效果关键取决于量化目标以及用于获得低比特权重表示的舍入过程。本研究表明,在对称与非对称校准之间进行插值可作为一种正则化形式,既能保留PTQ中使用的标准二次结构,又能对激活失配保持鲁棒性。基于这一视角,我们推导出一种简单的逐次舍入过程,该过程自然地融入了非对称校准,并进一步提出一种有界搜索扩展方法,允许在量化质量与计算成本之间进行显式权衡。在多种LLM系列、量化比特宽度和基准测试上的实验表明,基于正则化非对称校准目标的有界搜索方法,相较于PTQ基线模型,能够持续提升困惑度与准确率,同时仅产生适度且可控的额外计算成本。