Deep neural networks have been proven effective in a wide range of tasks. However, their high computational and memory costs make them impractical to deploy on resource-constrained devices. To address this issue, quantization schemes have been proposed to reduce the memory footprint and improve inference speed. While numerous quantization methods have been proposed, they lack systematic analysis for their effectiveness. To bridge this gap, we collect and improve existing quantization methods and propose a gold guideline for post-training quantization. We evaluate the effectiveness of our proposed method with two popular models, ResNet50 and MobileNetV2, on the ImageNet dataset. By following our guidelines, no accuracy degradation occurs even after directly quantizing the model to 8-bits without additional training. A quantization-aware training based on the guidelines can further improve the accuracy in lower-bits quantization. Moreover, we have integrated a multi-stage fine-tuning strategy that works harmoniously with existing pruning techniques to reduce costs even further. Remarkably, our results reveal that a quantized MobileNetV2 with 30\% sparsity actually surpasses the performance of the equivalent full-precision model, underscoring the effectiveness and resilience of our proposed scheme.
翻译:深度神经网络在广泛的任务中已被证明有效。然而,其高昂的计算和内存成本使其难以部署在资源受限的设备上。为解决此问题,研究者提出了量化方案以减少内存占用并提升推理速度。尽管已有大量量化方法被提出,但缺乏对其有效性的系统性分析。为弥补这一空白,我们收集并改进了现有量化方法,提出了一套适用于训练后量化的黄金准则。我们使用ResNet50和MobileNetV2两种流行模型在ImageNet数据集上评估了所提方法的有效性。遵循我们的准则,即使直接对模型进行8比特量化而无需额外训练,也不会出现精度下降。基于该准则的量化感知训练可进一步在低位量化中提升精度。此外,我们集成了一个多阶段微调策略,该策略能与现有剪枝技术协同工作以进一步降低成本。值得注意的是,我们的结果表明,一个具有30%稀疏度的量化MobileNetV2实际上超越了等效全精度模型的性能,这充分体现了我们提出方案的有效性和鲁棒性。