Post-training quantization (PTQ) is a cornerstone for efficiently deploying large language models (LLMs), where a small calibration set critically affects quantization performance. However, conventional practices rely on random sequences of fixed length, overlooking the variable-length nature of LLM inputs. Input length directly influences the activation distribution and, consequently, the weight importance captured by the Hessian, which in turn affects quantization outcomes. As a result, Hessian estimates derived from fixed-length calibration may fail to represent the true importance of weights across diverse input scenarios. We propose MaCa (Matryoshka Calibration), a simple yet effective method for length-aware Hessian construction. MaCa (i) incorporates multi-scale sequence length information into Hessian estimation and (ii) regularizes each sequence as an independent sample, yielding a more stable and fruitful Hessian for accurate quantization. Experiments on state-of-the-art LLMs (e.g., Qwen3, Gemma3, LLaMA3) demonstrate that MaCa consistently improves accuracy under low bit quantization, offering a lightweight enhancement compatible with existing PTQ frameworks. To the best of our knowledge, this is the first work to systematically highlight the role of multi-scale calibration in LLM quantization.
翻译:训练后量化(PTQ)是高效部署大语言模型(LLM)的基石,其中小型校准集对量化性能具有关键影响。然而,传统方法依赖固定长度的随机序列,忽略了LLM输入的可变长度特性。输入长度直接影响激活分布,进而影响海森矩阵所捕获的权重重要性,最终改变量化结果。因此,基于固定长度校准得到的海森矩阵估计可能无法代表不同输入场景下权重的真实重要性。我们提出MaCa(套娃校准),一种简单有效的长度感知海森矩阵构建方法。MaCa(i)将多尺度序列长度信息融入海森矩阵估计,(ii)将每个序列作为独立样本进行正则化,从而产生更稳定、更有效的海森矩阵以实现精确量化。在先进LLM(如Qwen3、Gemma3、LLaMA3)上的实验表明,MaCa在低比特量化下持续提升精度,提供了一种与现有PTQ框架兼容的轻量级增强方案。据我们所知,这是首个系统性地强调多尺度校准在LLM量化中作用的研究。