Although post-training quantization (PTQ) provides an efficient numerical compression scheme for deploying large language models (LLMs) on resource-constrained devices, the representativeness and universality of calibration data remain a core bottleneck in determining the accuracy of quantization parameters. Traditional PTQ methods typically rely on limited samples, making it difficult to capture the activation distribution during the inference phase, leading to biases in quantization parameters. To address this, we propose \textbf{FAQ} (Family-Aware Quantization), a calibration data regeneration framework that leverages prior knowledge from LLMs of the same family to generate high-fidelity calibration samples. Specifically, FAQ first inputs the original calibration samples into a larger LLM from the same family as the target model, regenerating a series of high-fidelity calibration data using a highly consistent knowledge system. Subsequently, this data, carrying Chain-of-Thought reasoning and conforming to the expected activation distribution, undergoes group competition under expert guidance to select the best samples, which are then re-normalized to enhance the effectiveness of standard PTQ. Experiments on multiple model series, including Qwen3-8B, show that FAQ reduces accuracy loss by up to 28.5\% compared to the baseline with original calibration data, demonstrating its powerful potential and contribution.
翻译:尽管训练后量化(PTQ)为在资源受限设备上部署大语言模型(LLMs)提供了一种高效的数值压缩方案,但校准数据的代表性与普适性仍是决定量化参数准确性的核心瓶颈。传统PTQ方法通常依赖有限样本,难以捕捉推理阶段的激活分布,导致量化参数存在偏差。为解决此问题,我们提出 \textbf{FAQ}(家族感知量化),一种利用同家族LLMs的先验知识生成高保真校准样本的校准数据再生框架。具体而言,FAQ首先将原始校准样本输入到与目标模型同家族的更大规模LLM中,利用高度一致的知识体系再生一系列高保真校准数据。随后,这些携带思维链推理且符合预期激活分布的数据,在专家指导下进行组间竞争以筛选最佳样本,并经过重归一化处理以增强标准PTQ的有效性。在包括Qwen3-8B在内的多个模型系列上的实验表明,相较于使用原始校准数据的基线方法,FAQ将精度损失降低了高达28.5%,证明了其强大的潜力与贡献。