Recently, post-training quantization (PTQ) has driven much attention to produce efficient neural networks without long-time retraining. Despite its low cost, current PTQ works tend to fail under the extremely low-bit setting. In this study, we pioneeringly confirm that properly incorporating activation quantization into the PTQ reconstruction benefits the final accuracy. To deeply understand the inherent reason, a theoretical framework is established, indicating that the flatness of the optimized low-bit model on calibration and test data is crucial. Based on the conclusion, a simple yet effective approach dubbed as QDROP is proposed, which randomly drops the quantization of activations during PTQ. Extensive experiments on various tasks including computer vision (image classification, object detection) and natural language processing (text classification and question answering) prove its superiority. With QDROP, the limit of PTQ is pushed to the 2-bit activation for the first time and the accuracy boost can be up to 51.49%. Without bells and whistles, QDROP establishes a new state of the art for PTQ. Our code is available at https://github.com/wimh966/QDrop and has been integrated into MQBench (https://github.com/ModelTC/MQBench)
翻译:最近,后训练量化(PTQ)因无需长时间重训练即可生成高效神经网络而备受关注。尽管其成本较低,但当前的PTQ方法在极低位设定下往往表现不佳。本研究首次证实,将激活值量化适当引入PTQ重建过程有助于提升最终精度。为深入理解其内在原因,我们建立了理论框架,表明优化后的低位模型在校准数据与测试数据上的平坦性至关重要。基于这一结论,我们提出了一种名为QDROP的简单而有效的方法,它在PTQ过程中随机丢弃激活值的量化。在计算机视觉(图像分类、目标检测)和自然语言处理(文本分类、问答)等多个任务上的大量实验证明了其优越性。借助QDROP,PTQ的极限首次被推至2位激活值量化,精度提升最高可达51.49%。无需任何附加技巧,QDROP便建立了PTQ的新兴最优水平。我们的代码已开源至https://github.com/wimh966/QDrop,并已集成到MQBench(https://github.com/ModelTC/MQBench)中。