As a neural network compression technique, post-training quantization (PTQ) transforms a pre-trained model into a quantized model using a lower-precision data type. However, the prediction accuracy will decrease because of the quantization noise, especially in extremely low-bit settings. How to determine the appropriate quantization parameters (e.g., scaling factors and rounding of weights) is the main problem facing now. Many existing methods determine the quantization parameters by minimizing the distance between features before and after quantization. Using this distance as the metric to optimize the quantization parameters only considers local information. We analyze the problem of minimizing local metrics and indicate that it would not result in optimal quantization parameters. Furthermore, the quantized model suffers from overfitting due to the small number of calibration samples in PTQ. In this paper, we propose PD-Quant to solve the problems. PD-Quant uses the information of differences between network prediction before and after quantization to determine the quantization parameters. To mitigate the overfitting problem, PD-Quant adjusts the distribution of activations in PTQ. Experiments show that PD-Quant leads to better quantization parameters and improves the prediction accuracy of quantized models, especially in low-bit settings. For example, PD-Quant pushes the accuracy of ResNet-18 up to 53.08% and RegNetX-600MF up to 40.92% in weight 2-bit activation 2-bit. The code will be released at https://github.com/hustvl/PD-Quant.
翻译:作为一种神经网络压缩技术,训练后量化(PTQ)利用低精度数据类型将预训练模型转换为量化模型。然而,量化噪声会导致预测精度下降,尤其在超低位宽设置中更为显著。如何确定合适的量化参数(如缩放因子和权重的取整方式)是当前面临的主要问题。现有许多方法通过最小化量化前后特征间的距离来确定量化参数。将这种距离作为优化量化参数的度量仅考虑了局部信息。我们分析了最小化局部度量的问题,并指出其无法得到最优量化参数。此外,由于PTQ中校准样本数量有限,量化模型存在过拟合问题。本文提出PD-Quant以解决上述问题。PD-Quant利用网络量化前后预测结果的差异信息来确定量化参数。为缓解过拟合,PD-Quant调整了PTQ中激活值的分布。实验表明,PD-Quant能够获得更优的量化参数,提升量化模型的预测精度,尤其在低位宽设置下。例如,在权重量化为2比特、激活量化为2比特时,PD-Quant将ResNet-18的精度提升至53.08%,RegNetX-600MF的精度提升至40.92%。代码将在https://github.com/hustvl/PD-Quant开源。